'When using padding in sequence models, is Keras validation accuracy valid/ reliable?

I have a group of non zero sequences with different lengths and I am using Keras LSTM to model these sequences. I use Keras Tokenizer to tokenize (tokens start from 1). In order to make sequences have the same lengths, I use padding.

An example of padding:

# [0,0,0,0,0,10,3]
# [0,0,0,0,10,3,4]
# [0,0,0,10,3,4,5]
# [10,3,4,5,6,9,8]

In order to evaluate if the model is able to generalize, I use a validation set with 70/30 ratio. In the end of each epoch Keras shows the training and validation accuracy.

My big doubt is whether Keras validation accuracy is reliable when using padding. Because the validation set can simply be sequences of 0's --> [0,0,0]. Since there are a lot of sequences of 0's (because of padding), the model can easily learn and predict the sequences of 0's correctly, and as a result, create a fake high validation accuracy. In other words the model may learn sequences of zeros and not learn the real sequences.

So, does padding influences the validation accuracy in Keras?



Solution 1:[1]

No, padding definitely doesn't affect like you are thinking. When you pad, you take into account the longest and pad according to it, so even for that batch all neurons will derive some activation, the network will definitely not learn to predict just sequences of 0s because you have not provided them ever. Also, networks learn to predict values other than 0 more properly and predicting zeros is learnt by them while learning to predict the other values. I hope I have quelled your doubts regarding this one.

Solution 2:[2]

I know this answer is too late but I think it can be useful for other readers.

The short answer is YES! The padding influences the accuracy.

For handling the bad effect of padding, you can define new metrics. This new metric must ignore the class related to padding.

In this article are presented a BiLSTM model for POS tagging as a sequence tagging task. A special accuracy metric with ignoring a class (padding class) is presented to:

from keras import backend as K
 
def ignore_class_accuracy(to_ignore=0):
    def ignore_accuracy(y_true, y_pred):
        y_true_class = K.argmax(y_true, axis=-1)
        y_pred_class = K.argmax(y_pred, axis=-1)
 
        ignore_mask = K.cast(K.not_equal(y_pred_class, to_ignore), 'int32')
        matches = K.cast(K.equal(y_true_class, y_pred_class), 'int32') * ignore_mask
        accuracy = K.sum(matches) / K.maximum(K.sum(ignore_mask), 1)
        return accuracy
    return ignore_accuracy

Note that in this case one-hot label is used. Finally you can pass your new accuracy like this:

model.compile(loss='categorical_crossentropy',
              optimizer=Adam(0.001),
              metrics=['accuracy', ignore_class_accuracy(0)])

In training model, an output like this will reported (91% for normal accuracy and 81% for new special accuracy):

Epoch 1/10 1679/2054 [=======================>......] - ETA: 2:33 -
loss: 0.2901 - accuracy: 0.9147 - ignore_accuracy: 0.8118

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Abhishek Verma
Solution 2 Alireza Mazochi