'TFBertForPretraining prediction logits in Masked LM

I am trying to get a grasp on pre training BERT models. I am using the TFBertForPreTraining model.

According the original paper BERT is pre trained on Next sentence prediction and mask language modeling simultaneously. Hence I pick pairs of sequences where 50% of them are following one another and 50% are not. Additionally I mask 15% of the input tokens and remember the masked tokens as labels. Now when passing the inputs through the model I get a prediction_logits and seq_relationship_logits as outputs. The prediction_logits are of shape (batch_size, max_length, vocab_size). As far as I am concerned, each (i,j,:) is the probability distribution on the vocabulary for the j-th token in the input sequence.

Now the original paper states that the loss is

the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood

Finally my question would be if the mean for LM is taken over the whole sequence or only on the masked token or maybe also on the whole sequence but the padding tokens.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source