'Word2vec raise KeyError(f"Key '{key}' not present")

Currently using gensim 4.0 library to write the code. However, I don't know why it keeps failing in finding a similar word. At first, when I set up min_count = 5, the error is, that it wants me to build a vocab first, but after I reduce it to min_count = 1, it says, key error not present...Full code with datasets over here: https://github.com/JYjunyang/FYPDEMO Am I writing something wrong or missing some important steps? Everything works fine but just this word2vec implementation...Will appreciate for every guidance provided... Take note: LemmaColumn is a dataframe after lemmatization

def FeaturesExtraction():
    word2vec = 
Word2Vec(sentences=LemmaColumn,vector_size=100,window=5,min_count=1,workers=8,sg=1)
    b1 = time.time()
    train_time = time.time() - b1
    print(word2vec.wv.most_similar('virus', topn=10))

And I not sure why, after training with 10k data, unique words in vocabulary only have 7:
word #0/7 is t
word #1/7 is l
word #2/7 is x
word #3/7 is e
word #4/7 is _
word #5/7 is u
word #6/7 is f



Solution 1:[1]

Your LemmaColumn variable probably isn't in the format Word2Vec needs for the sentences argument. It needs a Python sequence: something than can be iterated over multiple times, like a list, or another re-iterable object. And in that sequence, every individual item must itself be a list-of-string-tokens (words).

Your tiny vocabulary is instead what I'd expect to see if instead:

LemmaColumn = [ 
    ['f', 'u', 'l', 'l', '-', 't', 'e', 'x', 't'],
]

…or even…

LemmaColumn = [ 
    ['full-text'],
]

…because Python will happily treat a plain string (like 'full-text') as if it were a list filled with 1-character strings. Thus your entire training vocabular is only the characters of that single string – likely a column-name, rather than the column-data you want to be using.

Double-check what's in LemmaColumn. Perform the necessary transformations on the column's data to make it the kind of sequence Word2Vec expects, & confirm it looks sensible before trying Word2Vec.

Also: running with logging on to at least the INFO level will show a lot more of the model's progress, and as you learn to understand the reported steps/progress, things like weirdly-low counts of texts/words, or steps that'd take time if they were working on the right amount (lots) of data completing instantly, will be evident sooner.

Finally, note that min_count=1 is essentially always a bad idea with an algorithm like word2vec. Good vectors only come from multiple varied examples of the same word's usage – hence the default min_count=5. Keeping rare words not only tends to get poor vectors for those rare words, but the fact that natural-language text tends to have lots of such rare words means so much of the model's time & space is devoted to the (nearly hopeless) task of improving those junk words' vectors that other nearby words' vectors suffer as well.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 gojomo