'How to map words from feature importances using RandomForest and Word2Vec in Pyspark

I know how to extract each words from featureImportances array of trained model using TF-IDF, but I cant be sure same using Word2Vec model.

I'm mapping the words and importance score with below approach

  • First, get ml attributes from the metadata of dataset. This will give us dictionary list which contains name (numeric or binary format) and index values

     for i in dataset.schema[featuresCol].metadata["ml_attr"]["attrs"]:
             list_extract = list_extract + dataset.schema[featuresCol].metadata["ml_attr"]["attrs"][i]
    

    (I merged both numeric and binary fields).

  • Then iterating all the list...

     for word_dict in list_extract:
             wordName = word_dict['name']
             wordIndex = word_dict['idx']
    

word_dict['name'] variable gives me the word name in a format like 'inputCol_value'. I need to understand how to map this value to word vocab.

I can use this method with TF-IDF model, because the name part contains hashed value of word, I tested it fits. For example: lets say I have a vocab in a format {'hashed_value1': word_name1, ...}

Then I can get the importances of each word like this:

wordName = wordName.replace(inputCol+'_', '') # to extract the value like I described above.
name = vocab[int(wordName)] # this gives me the name of word
score = featureImp[wordIndex] # this gives me the score of corresponded word.

My question is how can I map words in Word2Vec model ? It's output is also in numeric format. But I cant do the same like I did on TF-IDF model before.

I'm creating the Word2Vec model like this:

Word2Vec(inputCol = 'word_col', outputCol = 'vec_col', vectorSize=300, minCount=1)

With the same algorithm I used for TF-IDF model, it gives me numeric values in wordName. Let's say I have a vocab with 30 words, if I set vectorSize=300, it gives me a value which is smaller than 300 for each word.

How can I map the words in this senario ? For example, It gave me 249 as wordName, My vocab size is 30, it cant be index or the name.

I tried to set 'vectorSize input' same as vocab size, it works but cant be sure if I correctly mapping.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source