'Analyzer ignoring certain word when used in Sklearn Tfidf

Here is my code:

def ngrams(string, n=4):
    string = re.sub(r'[,-./]|\sBD',r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    R = [''.join(ngram) for ngram in ngrams]
    if len(R) == 0:
        return string
    else:
        return R

L = ['a', 'aa', 'aaa', 'a', 'aa', 'aaa']

vectorizer = TfidfVectorizer(min_df = 0, token_pattern='(?u)\\b\\w+\\b', analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(L)

print(vectorizer.vocabulary_)

The output of vocabulary is {'a': 0}.

I am confused where are "aa" and "aaa" and when you check my ngrams function, I am returning string if it's length is less then the parameter (which is 4 in above code).

The token regex is also made in a way to accept single character.

Solution 1:^[1]

This is a theory.

I believe TfidVectorizer expects the analyzer function to return a sequence. Notice the inputs vs outputs of your ngrams function:

'a'  -> 'a'
'aa' -> 'aa'
'aaa' -> 'aaa'
'aaaa' -> ['aaaa']
'aaaaa' -> ['aaaa','aaaa']

A string is a sequence, so in the first 3 cases, you are returning a sequence that consists of repeats of the single letter 'a'.

If my theory is correct, you need to replace

        return string

with

        return [string]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Tim Roberts

'Analyzer ignoring certain word when used in Sklearn Tfidf

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]