'Running PySpark on AWS' EMR need help removing stop words from dataframe
As mentioned above I'm running a 64GB csv file on AWS EMR cluster using Jupyter notebook. I concatenated my two columns into one docum = concat(title, abstract) this is a sample of the data
| docum|
+--------------------+
|Clinical features...|
|Nitric oxide: a p...|
|Surfactant protei...|
|Role of endotheli...|
|Gene expression i...|
+--------------------+
only showing top 5 rows```
The data set is too large to even post a full document on here. But I need
help removing the stopwords so I can run Kmeans on this data.
I tried using the gensim but the module is not available on pyspark, I tried throwing it into a
python list but it was too large of a file I ran out or memory. This is the last Step I did
df2=df.select(concat(df.title,df.abstract))
df2 = df2.withColumnRenamed("concat(title, abstract)","docum")
now I just need to figure out stopwords so I can continue.
Thnak you for your time.
Solution 1:[1]
You can use Spark ML transformer for that:
from pyspark.ml.feature import Tokenizer, StopWordsRemover
text = """
The data set is too large to even post a full document on here. But I need
help removing the stopwords so I can run Kmeans on this data.
I tried using the gensim but the module is not available on pyspark, I tried throwing it into a
python list but it was too large of a file I ran out or memory. This is the last Step I did
"""
df = spark.createDataFrame([(1, text)], ["id", "text"])
# seperate text to words
tokenizer = Tokenizer(inputCol="text", outputCol="words")
words_df = tokenizer.transform(df)
# remove defined stop words
remover = StopWordsRemover(inputCol="words", outputCol="result", stopWords=["the", "a", "is", "it", "to"])
final_df = remover.transform(words_df).select("result")
display(final_df)
Links: StopWordsRemover
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Netanel Malka |

