'compute n-grams by category column with pandas
I'm trying to find the most used n-grams of a pandas column in python. I managed to gather the following code allowing me to do exactly that.
However I would like to have the results split by "category" column. Instead of having a line with bi-gram|total frequency like
"blue orange"|1
I would like three columns of bi-gram|frequency fruit|frequency|meat like
"blue orange"|1|0
from sklearn.feature_extraction.text import CountVectorizer
data = {'text':['blue orange is tired', 'an apple', 'meat are great for my stomach'],
'category':['fruit', 'fruit', 'meat']}
df = pd.DataFrame(data)
word_vectorizer = CountVectorizer(ngram_range = (2, 3), analyzer = 'word')
sparse_matrix = word_vectorizer.fit_transform(df['text'])
frequencies = sum(sparse_matrix).toarray()[0]
df_ngrams = pd.DataFrame(frequencies, index = word_vectorizer.get_feature_names_out(), columns = ['frequency'])
df_ngrams.sort_values('frequency', ascending = False).head(50)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
