'Use RDD to map dataframe rows into custom objects pyspark
I want to convert each row of my dataframe into to a Python class object called Fruit.
I have a dataframe df with the following columns: Identifier, Name, Quantity
I also have a dictionary fruit_color
that looks like this:
fruit_color = {"Apple":"Red", "Lemon": "yellow", ...}
class Fruit(name: str, quantity: int, color: str, entryPointer: DataFrameEntry)
I also have an object called DataFrameEntry that takes as parameters a dataframe and an identifier.
class DataFrameEntry(df: DataFrame, index: int)
Now I am trying to convert each row of the dataframe "df" to this object using rdds and ultimately get a list of all fruits through this piece of code:
df.rdd.map(lambda x: Fruit(
x.__getitem__('Name'),
x.__getitem__('Quantity'),
fruit_color[x.__getitem__('Name')],
LogEntryPointer(original_df, trigger.__getitem__('StartMarker_Index')))).collect()
However, I keep getting this error:
PicklingError: Could not serialize object: Py4JError: An error occurred while calling o55.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
Maybe my approch is wrong? How can I generally convert each row of a dataframe to a specific object in pyspark?
Thank you a lot in advance!!
Solution 1:[1]
You need to make sure all objects, classes you're using inside map
is defined inside map
. To be more clear, RDD's map
will distribute workload across multiple workers (different machines), and those machines don't know what Fruit
is.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | pltc |