'Can not tokenize words with BertTokenizer in my way

I have an issue, I have a well working fine tuned BERT model, also I have a validation data: texts that was splited to tokens by .split()

Unfortunately, when I load data to my model it returns me tokens in the way BertTokenizer splits it. For ex. ['S', '&', 'P'] instead ['S&P'], or ['67', ',', '7'] instead ['67,7'].

Here is my code:

preds =[]
toks = []
for test_sentence in list(test_df.text):
    tokenized_sentence = tokenizer.encode(test_sentence)
    input_ids = torch.tensor([tokenized_sentence]).cuda()
    with torch.no_grad():
        output = model(input_ids)
    label_indices = np.argmax(output[0].to('cpu').numpy(), axis=2)
    # join bpe split tokens
    tokens = tokenizer.convert_ids_to_tokens(input_ids.to('cpu').numpy()[0])
    new_tokens, new_labels = [], []
    for token, label_idx in zip(tokens[1:-1], label_indices[0][1:-1]):
        if token.startswith("##"):
            new_tokens[-1] = new_tokens[-1] + token[2:]
        else:
            new_labels.append(tag_values[label_idx])
            new_tokens.append(token)
    toks.append(new_tokens)    
    preds.append(new_labels)

Can somebody help me please?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source