'BERT sentence embeddings from transformers

I'm trying to get sentence vectors from hidden states in a BERT model. Looking at the huggingface BertModel instructions here, which say:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = BertModel.from_pretrained("bert-base-multilingual-cased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt') 
output = model(**encoded_input)

So first note, as it is on the website, this does /not/ run. You get:

>>> Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'BertTokenizer' object is not callable

But it looks like a minor change fixes it, in that you don't call the tokenizer directly, but ask it to encode the input:

encoded_input = tokenizer.encode(text, return_tensors="pt")
output = model(encoded_input)

OK, that aside, the tensors I get, however, have a different shape than I expected:

>>> output[0].shape
torch.Size([1,11,768])

This is a lot of layers. Which is the correct layer to use for sentence embeddings? [0]? [-1]? Averaging several? I have the goal of being able to do cosine similarity with these, so I need a proper 1xN vector rather than an NxK tensor.

I see that the popular bert-as-a-service project appears to use [0]

Is this correct? Is there documentation for what each of the layers are?

bert-language-model huggingface-transformers

Solution 1:^[1]

I don't think there is single authoritative documentation saying what to use and when. You need to experiment and measure what is best for your task. Recent observations about BERT are nicely summarized in this paper: https://arxiv.org/pdf/2002.12327.pdf.

I think the rule of thumb is:

Use the last layer if you are going to fine-tune the model for your specific task. And finetune whenever you can, several hundred or even dozens of training examples are enough.
Use some of the middle layers (7-th or 8-th) if you cannot finetune the model. The intuition behind that is that the layers first develop a more and more abstract and general representation of the input. At some point, the representation starts to be more target to the pre-training task.

Bert-as-services uses the last layer by default (but it is configurable). Here, it would be [:, -1]. However, it always returns a list of vectors for all input tokens. The vector corresponding to the first special (so-called [CLS]) token is considered to be the sentence embedding. This where the [0] comes from in the snipper you refer to.

Solution 2:^[2]

While the existing answer of Jindrich is generally correct, it does not address the question entirely. The OP asked which layer he should use to calculate the cosine similarity between sentence embeddings and the short answer to this question is none. A metric like cosine similarity requires that the dimensions of the vector contribute equally and meaningfully, but this is not the case for BERT weights released by the original authors. Jacob Devlin (one of the authors of the BERT paper) wrote:

I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations. And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn't mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally).

However, that does not mean you can not use BERT for such a task. It just means that you can not use the pre-trained weights out-of-the-box. You can either train a classifier on top of BERT which learns which sentences are similar (using the [CLS] token) or you can use sentence-transformers which can be used in an unsupervised scenario because they were trained to produce meaningful sentence representations.

Solution 3:^[3]

As mentioned in other answers, BERT was not meant to produce sentence level embeddings. Now, let's work on the how we can leverage power of BERT for computing context-sensitive sentence level embeddings.

BERT does carry the context at word level, here is an example:

This is a wooden stick. Stick to your work.

Above two sentences carry the word 'stick', BERT does a good job in computing embeddings of stick as per sentence(or say, context).

Now, let's move to one another example:

--What is your age?

--How old are you?

Above two sentences are contextually very similar, so, we need a model that can accept a sentence or text chunk or paragraph and produce right embeddings collectively. Here is how it can be achieved.

Method 1:

Use pre-trained sentence_transformers, here is link to huggingface hub.

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim


model = SentenceTransformer(r"sentence-transformers/paraphrase-MiniLM-L6-v2")


embd_a = model.encode("What is your age?")
embd_b = model.encode("How old are you?")


sim_score = cos_sim(embd_a, embd_b)

print(sim_score)

output: tensor([[0.8648]])

Now, there may be a question on how can we train our on sentence_transformer, specific to a domain. Here we go,

Supervised approach:

A common challenge for Data Scientist or MLEngineers is to get rightly annotated data, mostly it is hard to get it in good volume, but say, if you have it here is how we can train our on sentence_transformer (don't worry, there is an unsupervised approach too).

model = SentenceTransformer('distilbert-base-nli-mean-tokens') 

train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)

#Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)

More details here.

Tip: If you have a set of sentences that are similar to each other, say, you have a CSV, where column A and B contains sentences similar to each other(I mean each row will have a pair of sentences which are similar to each other), just load the csv and assign random values between 0.85 to 0.95 as similarity score and proceed.

Unsupervised approach

Say you don't have a huge set of annotated data, but you want to train a domain specific sentence_transformer, here is how we do it. Even for unsupervised training, data will be required, i.e. list of sentences/paragraphs, but need not to be annotated. Say, you don't have any data at all, still there is a work round (please visit last part of the answer).

Multiple approaches are available for unsupervised training, listing two of the most prominent ones. To see list of all available approaches, please visit here.

TSDAE link to research paper.

from sentence_transformers import SentenceTransformer, LoggingHandler
from sentence_transformers import models, util, datasets, evaluation, losses
from torch.utils.data import DataLoader

# Define your sentence transformer model using CLS pooling
model_name = 'bert-base-uncased'
word_embedding_model = models.Transformer(model_name)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), 'cls')
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

# Define a list with sentences (1k - 100k sentences)
train_sentences = ["Your set of sentences",
                   "Model will automatically add the noise", 
                   "And re-construct it",
                   "You should provide at least 1k sentences"]

# Create the special denoising dataset that adds noise on-the-fly
train_dataset = datasets.DenoisingAutoEncoderDataset(train_sentences)

# DataLoader to batch your data
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)

# Use the denoising auto-encoder loss
train_loss = losses.DenoisingAutoEncoderLoss(model, decoder_name_or_path=model_name, tie_encoder_decoder=True)

# Call the fit method
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
    weight_decay=0,
    scheduler='constantlr',
    optimizer_params={'lr': 3e-5},
    show_progress_bar=True
)

model.save('output/tsdae-model')

SimCSE link to research paper

from sentence_transformers import SentenceTransformer, InputExample
from sentence_transformers import models, losses
from torch.utils.data import DataLoader

# Define your sentence transformer model using CLS pooling
model_name = 'distilroberta-base'
word_embedding_model = models.Transformer(model_name, max_seq_length=32)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

# Define a list with sentences (1k - 100k sentences)
train_sentences = ["Your set of sentences",
                   "Model will automatically add the noise",
                   "And re-construct it",
                   "You should provide at least 1k sentences"]

# Convert train sentences to sentence pairs
train_data = [InputExample(texts=[s, s]) for s in train_sentences]

# DataLoader to batch your data
train_dataloader = DataLoader(train_data, batch_size=128, shuffle=True)

# Use the denoising auto-encoder loss
train_loss = losses.MultipleNegativesRankingLoss(model)

# Call the fit method
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
    show_progress_bar=True
)

model.save('output/simcse-model')

Tip: If you carefully observer, major difference is in the loss function used. To see a list of all the loss function applicable to such training scenarios, visit here. Also, with all the experiments I did, I found that TSDAE is more useful, when you want decent precision and good recall. However, SimCSE can be used when you want very high precision and low recall.

Now, if you don't have sufficient data to fine tune the model, but you find a BERT model trained on your domain, you can directly leverage that by adding pooling and dense layers. Please do research on what is 'pooling', to have better understanding on what you are doing.

from sentence_transformers import SentenceTransformer, models
from torch import nn

word_embedding_model = models.Transformer('bert-base-uncased', max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
dense_model = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=256, activation_function=nn.Tanh())

model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])

Tip: With above approach, if you start getting extreme high cosine score, it is an alarm to do negative testing. Sometime, simply adding pooling layers may not help, you must take few examples and check similarity scores for the inputs that are not similar (it is possible that even for dissimilar sentences, this may show good similarity, and that is the time you should stop and try to collect some data and do unsupervised training)

People who are interested in going deeper, here is a list of topics that may help you.

Pooling
Siamese Networks
Contrastive Loss

:) :)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2
Solution 3	YoungSheldon

'BERT sentence embeddings from transformers

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]