'training CNN model using word2vectorization,while invoking get_vector() .Showing error " KeyError: 'CALLDATASIZE' " while preparing train_x

set of word vectors are generated from github link:https://github.com/jianwei76/SoliAudit/blob/master/va/features/op.origin.csv.xz.

Converted this op.origin.csv.xz file to .txt file using gen_doc() function,


    opfile=op.origin.csv.xz #downloaded and uploaded in google colab folder
    binfile=model.bin # new binfile created to save the model generated from word2vec model

    def op_name(op):
        return op.rstrip('0123456789') 
    
    def filter_op(op_line):
        filter_ops = [ op_name(op) for op in op_line.split() ]
        return ' '.join(filter_ops)
    
    def gen_doc(opfile, docfile):
        op = pd.read_csv(opfile, compression='xz', index_col=0)
        op.dropna(inplace=True)
        op['Opcodes'] = op['Opcodes'].apply(filter_op)
    
    def get_model(opfile, binfile, size=5):
       docfile = 'op-doc.tmp.txt'
       gen_doc(opfile, docfile)
       logging.info('Training opcode word2vec...in=%s, out=%s, word-embed-size=%d' % (docfile, binfile, size))
       word2vec.word2vec(docfile, binfile, size=size, verbose=True)
         
    return word2vec.load(binfile)
    ```
    
    
    For the Code snippet:
    ``` 
    op_vecs = [ opline_to_vec(row['Opcodes'], w2v) for idx, row in data.iterrows() ]
    ```
    invokes function
    ```
        def opline_to_vec(line, w2v):
            print('inside oplinetovec func')
            ops = line.split()
            print('ops and line.split done')
            vec = np.zeros((len(ops), w2v.vectors.shape[1]))
            print('vec computed')
            for i, op in enumerate(ops):
                print('each vec i values')
                vec[i] = w2v.get_vector(op_name(op))***
                print(vec[i])
    
            print ('returning from opline_to_vec')    
            return vec

the output of op-doc-temp.txt-->


    CALLDATASIZE SUB DUP ADD SWAP DUP DUP CALLDATALOAD PUSH AND SWAP PUSH ADD SWAP SWAP SWAP SWAP POP POP POP PUSH JUMP JUMPDEST PUSH MLOAD DUP DUP DUP MSTORE PUSH ADD SWAP POP POP PUSH MLOAD DUP SWAP SUB SWAP RETURN JUMPDEST PUSH PUSH DUP CALLDATASIZE SUB DUP ADD SWAP DUP DUP CALLDATALOAD PUSH AND SWAP PUSH ADD SWAP SWAP SWAP SWAP DUP CALLDATALOAD SWAP PUSH ADD SWAP DUP ADD DUP CALLDATALOAD SWAP PUSH ADD SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP SWAP POP POP POP PUSH JUMP JUMPDEST STOP JUMPDEST CALLVALUE DUP ISZERO PUSH JUMPI PUSH DUP REVERT JUMPDEST POP PUSH PUSH DUP
 

I have highlighted the code snippet(vec[i] = w2v.get_vector(op_name(op))) which produces the error:

/usr/local/lib/python3.7/dist-packages/word2vec/wordvectors.py in ix(self, word)
     36         Returns the index on `self.vocab` and `self.vectors` for `word`
     37         """
---> 38         return self.vocab_hash[word]
     39 
     40     def word(self, ix):


KeyError: 'CALLDATASIZE'

enter image description here

It would be really great if you could please help



Solution 1:[1]

It looks like you're asking a word-vectors model for the vector of a word, 'CALLDATASIZE', that it does not know.

Where did the set of word-vectors come from? (Did you train them yourself, or import them from elsewhere? How did you load them?)

Would you expect it to have a vector for that weird opcode-word? If so, skip the other wraparound steps and just check for that word, and go back to the prior steps that you thought should have created that word-vector.

If it's reasonable the set doesn't have that word, and you can't fix that gap, change your code to handle that case - perhaps by ignoring the word.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 gojomo