'does mlm loss calculate non masked token's loss too?

In BERT, I understand what the Masked Language Model(MLM) pretraining task does, but when calculating the loss for this task, how is it exactly calculated?

It is obvious that the loss(e.g. cross entropy loss) for the masked tokens will be included in the final loss.

But what about the other tokens which aren't masked? Is loss calculated for these tokens and included in the final loss as well?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source