'How create new column in Spark using Python, based on other column?

My database contains one column of strings. I'm going to create a new column based on part of string of other columns. For example:

         "content"                             "other column"
The father has two dogs                            father
One cat stay at home of my mother                  mother
etc.                                               etc.

I thought to create an array with words who interessed me. For example: people=[mother,father,etc.]

Then, I iterate on column "content" and extract the word to insert on new column:



def extract_people(df):
    column=[]
    people=[mother,father,etc.]
    for row in df.select("content").collect():
        for word in people:
            if str(row).find(word):
                column.append(word)
                break
    return pd.Series(column)


f_pyspark = df_pyspark.withColumn('people', extract_people(df_pyspark))

This code don't work and give me this error on the collect():

22/01/26 11:34:04 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 36)
java.lang.OutOfMemoryError: Java heap space

Maybe because my file is too large, have 15 million of row. How I may make the new column in different mode?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source