'Is there any way to get Python pip to see Amazon's pyspark module?
I have an Amazon EMR cluster with Spark, and I can run import pyspark
to load the PySpark provided by EMR (in spark-python
RPM):
$ echo $PYTHONPATH
/usr/lib/spark/python/lib/py4j-0.10.9-src.zip:/usr/lib/spark/python:/usr/lib/spark/python/build:/usr/lib/spark/python/pyspark:/usr/lib/spark/python/lib/pyspark.zip
$ python
Python 3.7.10 (default, Jun 3 2021, 00:02:01)
[GCC 7.3.1 20180712 (Red Hat 7.3.1-13)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
>>> print(pyspark.__version__)
3.1.2+amzn.0
>>> print(pyspark.__path__)
['/usr/lib/spark/python/pyspark']
However, pip
does not show the pyspark module:
$ python -m pip freeze | grep -i pyspark
python37-sagemaker-pyspark==1.4.1
$
This is an issue because we are installing "delta-spark" package (https://pypi.org/project/delta-spark/) which requires "pyspark" and pip ends up downloading and installing the upstream "pyspark" when the Amazon EMR pyspark is already present.
Why doesn't pip show the Amazon pyspark module? Is there any way to get it to see it?
(I know we could use pip's --no-deps
to turn off dependency resolution but there is no way to do it in requirements.txt
for just one package; and generally speaking we want dependency resolution.)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|