'Analyzer ignoring certain word when used in Sklearn Tfidf
Here is my code:
def ngrams(string, n=4):
string = re.sub(r'[,-./]|\sBD',r'', string)
ngrams = zip(*[string[i:] for i in range(n)])
R = [''.join(ngram) for ngram in ngrams]
if len(R) == 0:
return string
else:
return R
L = ['a', 'aa', 'aaa', 'a', 'aa', 'aaa']
vectorizer = TfidfVectorizer(min_df = 0, token_pattern='(?u)\\b\\w+\\b', analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(L)
print(vectorizer.vocabulary_)
The output of vocabulary is {'a': 0}.
I am confused where are "aa" and "aaa" and when you check my ngrams function, I am returning string if it's length is less then the parameter (which is 4 in above code).
The token regex is also made in a way to accept single character.
Solution 1:[1]
This is a theory.
I believe TfidVectorizer expects the analyzer function to return a sequence. Notice the inputs vs outputs of your ngrams function:
'a' -> 'a'
'aa' -> 'aa'
'aaa' -> 'aaa'
'aaaa' -> ['aaaa']
'aaaaa' -> ['aaaa','aaaa']
A string is a sequence, so in the first 3 cases, you are returning a sequence that consists of repeats of the single letter 'a'.
If my theory is correct, you need to replace
return string
with
return [string]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Tim Roberts |
