'PySpark Self Signed certificate to access Artifactory from inside an EMR Jupyter Notebook

I am attempting to use a PySpark kernel from inside an EMR Notebook that is hosted on an AWS managed service (EMR) and I am unable to access Artifactory to install PyPi packages. On the EMR server itself I do have a PEM key for TLS/SSL and in /etc/pip.conf it is setup properly to access Artifactory and point to the certificate. I verified this by running a command to pip install SQLAlchemy from Artifactory after SSH'ing into an edge node. If I attempt to use a similar command inside an EMR Notebook using the PySpark kernel in order to scope the installed library to the notebook itself that fails due to a self-signed certificate error.

The command I am using is:

sc.install_pypi_package("pandas","https://<ARTIFACTORY_DOMAIN>/artifactory/api/pypi/pypi/simple/pandas/")

The output:

Collecting pandas Could not fetch URL https://<ARTIFACTORY_DOMAIN>/artifactory/api/pypi/pypi/simple/pandas/pandas/: There was a problem confirming the ssl certificate: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1091) - skipping

How can I resolve for this or start troubleshooting to see why there is an issue?

Edit: I am also using Livy Impersonation and am unsure if Livy itself would need to be configured for SSL certs.



Solution 1:[1]

The issue was that our EMR cluster was setup where the master node was the only instance that had the /etc/pip.conf file. When using a PySpark kernel in an EMR Studio Notebook the task nodes are used when attempting to install Python packages using sc.install_pypi_package(). Due to this I used a bootstrap script to write the /etc/pip.conf to all nodes and was able to access Artifactory after that was implemented. The SSL certificate was already on the task node prior so all that was needed was the pip.conf file.

To determine that I was on a task node I used the following code inside of a PySpark kernel notebook session. This matched with the internal IP address of a task node in our EMR cluster.

import socket
socket.gethostbyname(socket.gethostname())

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 rk92