'Gensim - index_to_key gives numbers instead of words as output
I'm trying to get 5 most frequent words from word2vec model created from wikipedia dump. I turned this model into KeyedVectors and the code looks like that:
from gensim.models import KeyedVectors
def vocabulary():
model = KeyedVectors.load("vectors.kv")
result = model.index_to_key[:5]
print(result)
The result is: [(507, 1), (858, 1), (785, 1), (251, 1), (9807, 1)]
On the other hand, when i'm trying the same with model made of tokenized text the result is: ['pdop', 'podany', 'wału', 'wytrzymałości', 'skręcanie']
Why am I getting numbers instead of words from the first model?
I used .get_texts() function, but I can't split the result of this function and pass it as sentences value to initiate the model, so I tried creating model with one short article, and then train it with data from .get_texts() article by article like this:
wiki_txt = wiki.get_texts()
model = Word2Vec.load('wiki_w2v.model')
for i in wiki_txt:
article = list(i)
model.train(article,total_examples = 1, epochs = 1)
Even if at this point article is a list[str], model still can't learn the data.
Solution 1:[1]
In a KeyedVectors from a Gensim Word2Vec model, the index_to_key property is usually a plain Python list, where each entry is a string word.
That yours are not this suggests something atypical happened during training. Perhaps the corpus that was fed to the Word2Vec model was made up of lists-of-tuples, rather than lists-of-string-tokens?
You should go back to the code which initially created & saved that vectors.kv file, with special attention to the corpus that was used for training. (What was its first element? Didi it look like a usual text-of-words – a list of strings – or something else?)
Updates based on comments below:
It looks like you are providing an literal instance of WikiCorpus to Word2Vec as the sentences argument. That's not the sort of corpus – a sequence where each item is a list-of-words – that Word2Vec expects.
The iterator returned by wiki.get_texts() is closer – it's the right format, a sequence-of-lists-of-words –but it only passes over the dump once, wherease Word2Vec needs something it can re-iterate over for multiple passes.
I recommend one of these three options:
- Writing your own wrapper iterable class, which will call
wiki.get_texts()again each time a new iteration is requested; or... - Writing some code to iterate through
wiki.get_texts()once, writing each text into a giant plain-text file, with one article per line. Then, runWord2Vecusing that file as input; or... - If you're one a giant-memory machine that can load the whole corpus as a list, you could do
corpus_list = list(wiki.get_texts(), then pass thatcorpus_listas the training data toWord2Vec.
Separately:
If your only goal is doing a Word2Vec training, you don't really need to do the long-running expensive initial load of the WikiCorpus (`wiki = WikiCorpus('plwiki-latest-pages-articles.xml.bz2') then re-save. That does a time-consuming vocabulary survey of the dump that isn't even consulted by later word2vec training, which does its own survey.
Instead, when you need to use the dump, you can just do:
wiki = WikiCorpus('plwiki-latest-pages-articles.xml.bz2', dictionary=dict())
That leaves the wiki object ready for operations like .get_texts(), without wastefully creating a redundant dictionary.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
