'sklearn/nltk - how to achieve stemming and configurable n-grams?

In the context of topic extraction, I have gotten stemmed top words and n-grams -- but not at the same time. My code:

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk.stem.snowball import EnglishStemmer

stemmer = EnglishStemmer()
analyzer = CountVectorizer().build_analyzer()

def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

vectorizer_s = CountVectorizer(max_df=0.95, max_features=1000, analyzer=stemmed_words, stop_words=my_stop_words, ngram_range=(2,4))

The output was successfully stemmed of word roots but the output was also all one-worders -- despite my minimum n-gram setting being at two. After re-reading the sklearn documentation, I don't see any caveat that n-grams can't be used with custom analyzers.

What's more is the output had all kinds of noise words like 'the' / 'and' / 'of' and so forth (so it's as though I even lose tf-idf when trying to add a stem parameter). When I remove the analyzer call, I get the n_gram results, but lose the stem-preprocessing.

Question

Is there any way to reconcile having stemming and n_grams configurables in the same vectorization function? Or maybe I should attempt to stem everything wayyyy earlier in the workflow?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source