'How to specify decoder_input_ids in Torch
I just started learning NLP and try to vectorize a piece of text using AutoTokenizer and pretrained 'cointegrated/rut5-small' tokenizer from Hugging Face. This is the code I use
import torch
from transformers import AutoTokenizer, AutoModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
tokenizer = AutoTokenizer.from_pretrained('cointegrated/rut5-small')
model = AutoModel.from_pretrained('cointegrated/rut5-small')
def embed_bert_cls(text, model, tokenizer):
t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**{k: v.to(model.device) for k, v in t.items()})
embeddings = model_output[0][:, 0, :]
embeddings = torch.nn.functional.normalize(embeddings)
return embeddings[0].cpu().numpy()
It seems to work. However, when I run a test
embed_bert_cls('привет как дела', model, tokenizer)
I receive the following error:
ValueError: You have to specify either decoder_input_ids or decoder_inputs_embeds
I fould several answers like this on stackoverflow and github saying that I need to use inputs['input_ids'] . But I don't really understand how to do it in my code.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
