'Can not tokenize words with BertTokenizer in my way
I have an issue, I have a well working fine tuned BERT model, also I have a validation data: texts that was splited to tokens by .split()
Unfortunately, when I load data to my model it returns me tokens in the way BertTokenizer splits it. For ex. ['S', '&', 'P'] instead ['S&P'], or ['67', ',', '7'] instead ['67,7'].
Here is my code:
preds =[]
toks = []
for test_sentence in list(test_df.text):
tokenized_sentence = tokenizer.encode(test_sentence)
input_ids = torch.tensor([tokenized_sentence]).cuda()
with torch.no_grad():
output = model(input_ids)
label_indices = np.argmax(output[0].to('cpu').numpy(), axis=2)
# join bpe split tokens
tokens = tokenizer.convert_ids_to_tokens(input_ids.to('cpu').numpy()[0])
new_tokens, new_labels = [], []
for token, label_idx in zip(tokens[1:-1], label_indices[0][1:-1]):
if token.startswith("##"):
new_tokens[-1] = new_tokens[-1] + token[2:]
else:
new_labels.append(tag_values[label_idx])
new_tokens.append(token)
toks.append(new_tokens)
preds.append(new_labels)
Can somebody help me please?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
