'Sequence Labelling with BERT

I am using a model consisting of an embedding layer and an LSTM to perform sequence labelling, in pytorch + torchtext. I have already tokenised the sentences.

If I use self-trained or other pre-trained word embedding vectors, this is straightforward.

But if I use the Huggingface transformers BertTokenizer.from_pretrained and BertModel.from_pretrained there is a '[CLS]' and '[SEP]' token added to the beginning and end of the sentence, respectively. So the output of the model becomes a sequence that is two elements longer than the label/target sequence.

What I am unsure of is:

  1. Are these two tags needed for the BertModel to embed each token of a sentence "correctly"?
  2. If they are needed, can I take them out after the BERT embedding layer, before the input to the LSTM, so that the lengths are correct in the output?


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source