'Why do I get unexpected tokenization while downloading the code-bert model?

I get the following error while loading the BertEmbedding:

Code:

name = "microsoft/codebert-base"

from transformers import BertModel
from transformers import BertTokenizer

print("[ Using pretrained BERT embeddings ]")
self.bert_tokenizer = BertTokenizer.from_pretrained(name, do_lower_case=lower_case)
self.bert_model = BertModel.from_pretrained(name)
if fix_emb:
    print("[ Fix BERT layers ]")
    self.bert_model.eval()
    for param in self.bert_model.parameters():
        param.requires_grad = False
else:
    print("[ Finetune BERT layers ]")
    self.bert_model.train()
[ Using pretrained BERT embeddings ]
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RobertaTokenizer'. 
The class this function is called from is 'BertTokenizer'.


Solution 1:[1]

The name codebert-base is a bit misleading, as the model is actually a Roberta. The architecture of Bert and Roberta is similar and shows only minor differences but the tokenizer is totally different (and the pertaining approach but that is not relevant here).

You should load microsoft/codebert-base like this:

from transformers import RobertaModel
from transformers import RobertaTokenizer

name = "microsoft/codebert-base"
tokenizer = RobertaTokenizer.from_pretrained(name)
model = RobertaModel.from_pretrained(name)

Or you can use the Auto Classes, which will select the proper classes for you:

from transformers import AutoTokenizer, AutoModel

name = "microsoft/codebert-base"
tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModel.from_pretrained(name)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 cronoik