'Running PySpark on AWS' EMR need help removing stop words from dataframe

As mentioned above I'm running a 64GB csv file on AWS EMR cluster using Jupyter notebook. I concatenated my two columns into one docum = concat(title, abstract) this is a sample of the data

|               docum|
+--------------------+
|Clinical features...|
|Nitric oxide: a p...|
|Surfactant protei...|
|Role of endotheli...|
|Gene expression i...|
+--------------------+
only showing top 5 rows```
The data set is too large to even post a full document on here. But I need
 help removing the stopwords so I can run Kmeans on this data. 
I tried using the gensim but the module is not available on pyspark, I tried throwing it into a 
python list but it was too large of a file I ran out or memory. This is the last Step I did

df2=df.select(concat(df.title,df.abstract))
df2 = df2.withColumnRenamed("concat(title, abstract)","docum")

now I just need to figure out stopwords so I can continue.

Thnak you for your time.



Solution 1:[1]

You can use Spark ML transformer for that:

from pyspark.ml.feature import Tokenizer, StopWordsRemover

text = """
The data set is too large to even post a full document on here. But I need
 help removing the stopwords so I can run Kmeans on this data. 
I tried using the gensim but the module is not available on pyspark, I tried throwing it into a 
python list but it was too large of a file I ran out or memory. This is the last Step I did
"""
df = spark.createDataFrame([(1, text)], ["id", "text"])

# seperate text to words
tokenizer = Tokenizer(inputCol="text", outputCol="words")
words_df = tokenizer.transform(df)

# remove defined stop words
remover = StopWordsRemover(inputCol="words", outputCol="result", stopWords=["the", "a", "is", "it", "to"])
final_df = remover.transform(words_df).select("result")

display(final_df)

enter image description here

Links: StopWordsRemover

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Netanel Malka