'Use RDD to map dataframe rows into custom objects pyspark

I want to convert each row of my dataframe into to a Python class object called Fruit.

I have a dataframe df with the following columns: Identifier, Name, Quantity

I also have a dictionary fruit_color that looks like this: fruit_color = {"Apple":"Red", "Lemon": "yellow", ...}

class Fruit(name: str, quantity: int, color: str, entryPointer: DataFrameEntry)

I also have an object called DataFrameEntry that takes as parameters a dataframe and an identifier.

class DataFrameEntry(df: DataFrame, index: int)

Now I am trying to convert each row of the dataframe "df" to this object using rdds and ultimately get a list of all fruits through this piece of code:

df.rdd.map(lambda x: Fruit(
            x.__getitem__('Name'),
            x.__getitem__('Quantity'),
            fruit_color[x.__getitem__('Name')],
            LogEntryPointer(original_df, trigger.__getitem__('StartMarker_Index')))).collect()

However, I keep getting this error:

PicklingError: Could not serialize object: Py4JError: An error occurred while calling o55.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)

Maybe my approch is wrong? How can I generally convert each row of a dataframe to a specific object in pyspark?

Thank you a lot in advance!!

Solution 1:^[1]

You need to make sure all objects, classes you're using inside map is defined inside map. To be more clear, RDD's map will distribute workload across multiple workers (different machines), and those machines don't know what Fruit is.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	pltc

'Use RDD to map dataframe rows into custom objects pyspark

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]