'Issue with 'pandas on spark' used with conda: "No module named 'pyspark.pandas'" even though both pyspark and pandas are installed

I have installed both Spark 3.1.3 and Anaconda 4.12.0 on Ubuntu 20.04. I have set PYSPARK_PYTHON to be the python bin of a conda environment called my_env

export PYSPARK_PYTHON=~/anaconda3/envs/my_env/bin/python

I installed several packages on conda environment my_env using pip. Here is a portion of the output of pip freeze command:

numpy==1.22.3
pandas==1.4.1
py4j==0.10.9.3
pyarrow==7.0.0

N.B: package pyspark is not installed on the conda environment my_env. I would like to be able to launch a pyspark shell on different conda environments without having to reinstall pyspark in every environment (I would like to only modify PYSPARK_PYTHON). This would also avoids having different versions of Spark on different conda environments (which is sometimes desirable but not always).

When I launch a pyspark shell using pyspark command, I can indeed import pandas and numpy which confirms that PYSPARK_PYTHON is properly set (my_env is the only conda env with pandas and numpy installed, moreover pandas and numpy are not installed on any other python installation even outside conda, and finally if I change PYSPARK_PYTHON I am no longer able to import pandas or numpy).

Inside the pyspark shell, the following code works fine (creating and showing a toy Spark dataframe):

sc.parallelize([(1,2),(2,4),(3,5)]).toDF(["a", "b"]).show()

However, if I try to convert the above dataframe into a pandas on spark dataframe it does not work. The command

sc.parallelize([(1,2),(2,4),(3,5)]).toDF(["t", "a"]).to_pandas_on_spark()

returns:

AttributeError: 'DataFrame' object has no attribute 'to_pandas_on_spark'

I tried to first import pandas (which works fine) and then pyspark.pandas before running the above command but when I run

import pyspark.pandas as ps

I obtain the following error:

ModuleNotFoundError: No module named 'pyspark.pandas'

Any idea why this happens ?

Thanks in advance



Solution 1:[1]

From here, it seems that you need apache spark 3.2, not 3.1.3. Update to 3.2 and you will have the desired API.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Alexandru Placinta