'How to specify decoder_input_ids in Torch

I just started learning NLP and try to vectorize a piece of text using AutoTokenizer and pretrained 'cointegrated/rut5-small' tokenizer from Hugging Face. This is the code I use

import torch
from transformers import AutoTokenizer, AutoModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
tokenizer = AutoTokenizer.from_pretrained('cointegrated/rut5-small') 
model = AutoModel.from_pretrained('cointegrated/rut5-small')

def embed_bert_cls(text, model, tokenizer):
    t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    
    with torch.no_grad():
        model_output = model(**{k: v.to(model.device) for k, v in t.items()})
    embeddings = model_output[0][:, 0, :]
    embeddings = torch.nn.functional.normalize(embeddings)
    return embeddings[0].cpu().numpy()

It seems to work. However, when I run a test

embed_bert_cls('привет как дела', model, tokenizer)

I receive the following error: ValueError: You have to specify either decoder_input_ids or decoder_inputs_embeds

I fould several answers like this on stackoverflow and github saying that I need to use inputs['input_ids'] . But I don't really understand how to do it in my code.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source