'Variable length input to embedding layer Keras
I have a variable size text corpus. I am trying to feed my texts to a LSTM model using the Embedding layer in keras. My code looks something like this:
import numpy as np
from keras.layers import Embedding, Input, LSTM, RNN, SimpleRNN
from keras.models import Model, Sequential
vocab_size = 20000
embedding_len = 50
model = Sequential()
model.add(Embedding(vocab_size, embedding_len))
I have genereted a sample input using numpy random number generator:
akak=[]
for i in range(10):
akak.append(np.random.randint(0, 20, size = (np.random.randint(1,30, size=None))))
input_array = np.asarray(akak)
print(input_array)
Output:
array([array([16, 2, 9, 12, 18, 10, 10, 14, 3, 11, 4, 6, 8, 11, 3]),
array([ 3, 6, 5, 8, 3, 10, 19, 9, 17]),
array([ 1, 6, 17, 14, 14, 19, 12, 15, 14, 0, 16, 2, 1, 18, 13, 14, 17,
14, 2, 11, 0, 19, 2, 8, 13, 10, 17, 13, 5]),
array([ 5, 10, 18, 0, 4, 8]),
array([ 5, 14, 19, 16, 10, 8, 13, 8, 12, 5, 19]),
array([ 7, 4, 17, 0, 10, 8, 3, 6, 14, 4, 8, 9, 0]),
array([ 4, 7, 7, 16, 7, 6, 16, 9, 4, 2, 11]),
array([ 2, 16, 15, 16, 18, 11, 7, 1, 0, 5, 11, 12, 11, 8, 3, 8, 8,
16, 19, 8]),
array([12, 18, 19, 15, 11, 6, 16, 16, 2, 12, 0, 14, 16, 0]),
array([12, 13, 13])], dtype=object)
When I try to predict the get only the embeddings using model.predict(input_array), i get the following error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-49-313059fadb55> in <module>()
----> 1 model.predict(input_array).shape
/home/biswadip/.local/lib/python3.6/site-packages/keras/engine/training.py in predict(self, x, batch_size, verbose, steps)
1167 batch_size=batch_size,
1168 verbose=verbose,
-> 1169 steps=steps)
1170
1171 def train_on_batch(self, x, y,
/home/biswadip/.local/lib/python3.6/site-packages/keras/engine/training_arrays.py in predict_loop(model, f, ins, batch_size, verbose, steps)
292 ins_batch[i] = ins_batch[i].toarray()
293
--> 294 batch_outs = f(ins_batch)
295 batch_outs = to_list(batch_outs)
296 if batch_index == 0:
/home/biswadip/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py in __call__(self, inputs)
2713 return self._legacy_call(inputs)
2714
-> 2715 return self._call(inputs)
2716 else:
2717 if py_any(is_tensor(x) for x in inputs):
/home/biswadip/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py in _call(self, inputs)
2653 array_vals.append(
2654 np.asarray(value,
-> 2655 dtype=tf.as_dtype(tensor.dtype).as_numpy_dtype))
2656 if self.feed_dict:
2657 for key in sorted(self.feed_dict.keys()):
/home/biswadip/.local/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
536
537 """
--> 538 return array(a, dtype, copy=False, order=order)
539
540
ValueError: setting an array element with a sequence.
I know I can just pad the sequence, but won't LSTM layer will return only the last hidden state for padded sequence. I want the hidden state from the last hidden state of the actual sequence and not the padded one i.e. if my sequence length is 15 and the maximum sequence length is 200, I want the hidden state vector from 15th state and not 200th state
Solution 1:[1]
From the keras documentation we got this very useful parameter for the Embedding layer:
mask_zero: Boolean, whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True, then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1).
This means, we should pad the sequences anyway, but set this parameter to True so the sequences can be used as variable length sequences for RNNs.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | raquelhortab |
