'Variable length input to embedding layer Keras

I have a variable size text corpus. I am trying to feed my texts to a LSTM model using the Embedding layer in keras. My code looks something like this:

import numpy as np
from keras.layers import Embedding, Input, LSTM, RNN, SimpleRNN
from keras.models import Model, Sequential


vocab_size = 20000
embedding_len = 50

model = Sequential()
model.add(Embedding(vocab_size, embedding_len))

I have genereted a sample input using numpy random number generator:

akak=[]
for i in range(10):
    akak.append(np.random.randint(0, 20, size = (np.random.randint(1,30, size=None))))
input_array = np.asarray(akak)
print(input_array)

Output:

array([array([16,  2,  9, 12, 18, 10, 10, 14,  3, 11,  4,  6,  8, 11,  3]),
       array([ 3,  6,  5,  8,  3, 10, 19,  9, 17]),
       array([ 1,  6, 17, 14, 14, 19, 12, 15, 14,  0, 16,  2,  1, 18, 13, 14, 17,
       14,  2, 11,  0, 19,  2,  8, 13, 10, 17, 13,  5]),
       array([ 5, 10, 18,  0,  4,  8]),
       array([ 5, 14, 19, 16, 10,  8, 13,  8, 12,  5, 19]),
       array([ 7,  4, 17,  0, 10,  8,  3,  6, 14,  4,  8,  9,  0]),
       array([ 4,  7,  7, 16,  7,  6, 16,  9,  4,  2, 11]),
       array([ 2, 16, 15, 16, 18, 11,  7,  1,  0,  5, 11, 12, 11,  8,  3,  8,  8,
       16, 19,  8]),
       array([12, 18, 19, 15, 11,  6, 16, 16,  2, 12,  0, 14, 16,  0]),
       array([12, 13, 13])], dtype=object)

When I try to predict the get only the embeddings using model.predict(input_array), i get the following error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-49-313059fadb55> in <module>()
----> 1 model.predict(input_array).shape

/home/biswadip/.local/lib/python3.6/site-packages/keras/engine/training.py in predict(self, x, batch_size, verbose, steps)
   1167                                             batch_size=batch_size,
   1168                                             verbose=verbose,
-> 1169                                             steps=steps)
   1170 
   1171     def train_on_batch(self, x, y,

/home/biswadip/.local/lib/python3.6/site-packages/keras/engine/training_arrays.py in predict_loop(model, f, ins, batch_size, verbose, steps)
    292                 ins_batch[i] = ins_batch[i].toarray()
    293 
--> 294             batch_outs = f(ins_batch)
    295             batch_outs = to_list(batch_outs)
    296             if batch_index == 0:

/home/biswadip/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py in __call__(self, inputs)
   2713                 return self._legacy_call(inputs)
   2714 
-> 2715             return self._call(inputs)
   2716         else:
   2717             if py_any(is_tensor(x) for x in inputs):

/home/biswadip/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py in _call(self, inputs)
   2653                 array_vals.append(
   2654                     np.asarray(value,
-> 2655                                dtype=tf.as_dtype(tensor.dtype).as_numpy_dtype))
   2656         if self.feed_dict:
   2657             for key in sorted(self.feed_dict.keys()):

/home/biswadip/.local/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
    536 
    537     """
--> 538     return array(a, dtype, copy=False, order=order)
    539 
    540 

ValueError: setting an array element with a sequence.

I know I can just pad the sequence, but won't LSTM layer will return only the last hidden state for padded sequence. I want the hidden state from the last hidden state of the actual sequence and not the padded one i.e. if my sequence length is 15 and the maximum sequence length is 200, I want the hidden state vector from 15th state and not 200th state



Solution 1:[1]

From the keras documentation we got this very useful parameter for the Embedding layer:

mask_zero: Boolean, whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True, then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1).

This means, we should pad the sequences anyway, but set this parameter to True so the sequences can be used as variable length sequences for RNNs.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 raquelhortab