'Inconsistent result output with gensim index_to_key

Good afternoon,

first of all thanks to all who take the time to read this. My problem is this, I would like to have a word2vec output the most common words.

I do this with the following command:

#how many words to print out ( sort at frequency)

x = list(model.wv.index_to_key[:2500])

Basically it works, but sometimes I get only 1948 or 2290 words printed out. I can't find any connection with the size of the original corpus (tokens, lines etc.) or deviation from the target value (if I increase the output value to e.g. 3500 it outputs 3207 words).

I would like to understand why this is the case, unfortunately I can't find anything on Google and therefore I don't know how to solve the problem. maybe by increasing the value and later deleting all rows after 2501 by using pandas



Solution 1:[1]

If any Python list ranged-access, like my_list[:n], returns less than n items, then the original list my_list had less than n items in it.

So, if model.wv.index_to_key[:2500] is only returning a list of length 1948, then I'm pretty sure if you check len(model.wv.index_to_key), you'll see the source list is only 1948 items long. And of course, you can't take the 1st 2500 items from a list that's only 1948 items long!

Why might your model have fewer unique words than you expect, or even that you counted via other methods?

Something might be amiss in your preprocessing/tokenization, but most likely is that you're not considering the effect of the default min_count=5 parameter. That default causes all words that appear fewer than 5 times to be ignored during training, as if they weren't even in the source texts.

You may be tempted to use min_count=1, to keep all words, but that's almost always a bad idea in word2vec training. Word2vec needs many subtly-contrasting alternate uses of a word to train a good word-vector.

Keeping words which only have one, or a few, usage examples winds up not failing to get good generalizable vectors for those rare words, but also interferes with the full learning of vectors for other nearby more-frequent words – now that their training has to fight the noise & extra model-cycles from the insufficiently-represented rare words.

Instead of lowering min_count, it's better to get more data, or live with a smaller final vocabulary.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 gojomo