'Speed up embedding of 2M sentences with RoBERTa

I have roughly 2 million sentences that I want to turn into vectors using Facebook AI's RoBERTa-large,fine-tuned on NLI and STSB for sentence similarity (using the awesome sentence-transformers package).

I already have a dataframe with two columns: "utterance" containing each sentence from the corpus, and "report" containing, for each sentence, the title of the document from which it is from.

From there, my code is the following:

from sentence_transformers import SentenceTransformer
from tqdm import tqdm

model = SentenceTransformer('roberta-large-nli-stsb-mean-tokens')

print("Embedding sentences")

data = pd.read_csv("data/sentences.csv")

sentences = data['utterance'].tolist()

sentence_embeddings = []

for sent in tqdm(sentences):
    embedding = model.encode([sent])
    sentence_embeddings.append(embedding[0])

data['vector'] = sentence_embeddings

Right now, tqdm estimates that the whole process will take around 160 hours on my computer, which is more than I can spare.

Is there any way I could speed this up by changing my code? Is creating a huge list in memory then appending it to the dataframe the best way to proceed here? (I suspect not).

Many thanks in advance!



Solution 1:[1]

in the same model . if you want to have more efficient encode way.

you can convert sentence-transoformer model to onnxmodel of onnxruntime or planmodel of tensorrt. but the author of sentence-transformer do not provide the way to convert it.

i find a tourial show the steps of convert it . quick_sentence_transformers

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 yuanzz