'org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) - Pyspark and PLIP object
I'm trying to iterate throught a pyspark dataframe by using pyspark udf. But I have an error when using an object from the python module I'm interested with (PLIP).
Let me show you with a simple and reproducable example of the error:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col
from plip.structure.preparation import PDBComplex
spark = (
SparkSession.builder
.getOrCreate()
)
columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
rdd = spark.sparkContext.parallelize(data)
example_df = rdd.toDF()
example_df = rdd.toDF(columns)
example_df.show()
def example_function(language_col):
protlig = PDBComplex()
return language_col
udf_example_function = udf(example_function)
after_example_function_df = (
example_df
.withColumn("example_function_ouput",
udf_example_function(col("language"))
)
)
after_example_function_df.show()
If you want to reproduce the error, here are the requirements I used (arm64 architecture)
conda create plip_env
conda create --name plip_env
conda activate plip_env
conda install python=3.8.12
conda install matplotlib=3.5.1
conda install pandas=1.4.1
conda install mamba -n base -c conda-forge
mamba create -n opencadd opencadd
conda install -c conda-forge plip
conda install pyspark=3.2.1
Note that the return here doesn't matter, I'm just trying to make work the PLIP object and udf.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
