'How to map words from feature importances using RandomForest and Word2Vec in Pyspark
I know how to extract each words from featureImportances array of trained model using TF-IDF, but I cant be sure same using Word2Vec model.
I'm mapping the words and importance score with below approach
First, get ml attributes from the metadata of dataset. This will give us dictionary list which contains name (numeric or binary format) and index values
for i in dataset.schema[featuresCol].metadata["ml_attr"]["attrs"]: list_extract = list_extract + dataset.schema[featuresCol].metadata["ml_attr"]["attrs"][i](I merged both numeric and binary fields).
Then iterating all the list...
for word_dict in list_extract: wordName = word_dict['name'] wordIndex = word_dict['idx']
word_dict['name'] variable gives me the word name in a format like 'inputCol_value'. I need to understand how to map this value to word vocab.
I can use this method with TF-IDF model, because the name part contains hashed value of word, I tested it fits. For example: lets say I have a vocab in a format {'hashed_value1': word_name1, ...}
Then I can get the importances of each word like this:
wordName = wordName.replace(inputCol+'_', '') # to extract the value like I described above.
name = vocab[int(wordName)] # this gives me the name of word
score = featureImp[wordIndex] # this gives me the score of corresponded word.
My question is how can I map words in Word2Vec model ? It's output is also in numeric format. But I cant do the same like I did on TF-IDF model before.
I'm creating the Word2Vec model like this:
Word2Vec(inputCol = 'word_col', outputCol = 'vec_col', vectorSize=300, minCount=1)
With the same algorithm I used for TF-IDF model, it gives me numeric values in wordName. Let's say I have a vocab with 30 words, if I set vectorSize=300, it gives me a value which is smaller than 300 for each word.
How can I map the words in this senario ? For example, It gave me 249 as wordName, My vocab size is 30, it cant be index or the name.
I tried to set 'vectorSize input' same as vocab size, it works but cant be sure if I correctly mapping.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
