'sklearn/nltk - how to achieve stemming and configurable n-grams?
In the context of topic extraction, I have gotten stemmed top words and n-grams -- but not at the same time. My code:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk.stem.snowball import EnglishStemmer
stemmer = EnglishStemmer()
analyzer = CountVectorizer().build_analyzer()
def stemmed_words(doc):
return (stemmer.stem(w) for w in analyzer(doc))
vectorizer_s = CountVectorizer(max_df=0.95, max_features=1000, analyzer=stemmed_words, stop_words=my_stop_words, ngram_range=(2,4))
The output was successfully stemmed of word roots but the output was also all one-worders -- despite my minimum n-gram setting being at two. After re-reading the sklearn documentation, I don't see any caveat that n-grams can't be used with custom analyzers.
What's more is the output had all kinds of noise words like 'the' / 'and' / 'of' and so forth (so it's as though I even lose tf-idf when trying to add a stem parameter). When I remove the analyzer call, I get the n_gram results, but lose the stem-preprocessing.
Question
Is there any way to reconcile having stemming and n_grams configurables in the same vectorization function? Or maybe I should attempt to stem everything wayyyy earlier in the workflow?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
