'NLP preprocessing remove all words in string not found in my list
I've made a list of important_words and a have a dataframe that has a column df['reviews'], that has one string of review text per row (thousands of rows). I want to update the 'reviews' by removing everything that is not in the important_words list from the string, like the opposite of having stop words, so that I am only left with the important_words per every review (row) in the df.
Also, later in my starter code I tokenize and normalize the column of df[reviews], it seems like applying to this column should make everything easier, since punctuation removal and lowercasing has also been applied. I'll try which ever method someone can share, thanks.
important_words = [actor, action, awesome]
df['reviews'][1] = 'The actor, in the action movie was awesome'
df['reviews'][2] = 'The action movie was not good'
....
df['tokenized_normalized_reviews'][1] = [the,actor,in,the,action,movie,was,awesome]
df['tokenized_normalized_reviews'][2] = [the, action, movie, was, not, good]
I want:
df['review_important_words'][1] = 'actor, action, awesome'
df['review_important_words'][2] = 'action'
< either str or applied to the tokenized column>
Solution 1:[1]
df['reviews'] = df['reviews'].apply(lambda x: ' '.join([word for word in x.split() if word in (important_words)]))
You can do it like this using pandas. Applying the function would make it work for all the elements of this column.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
