'Fine-Tuning DistilBertForSequenceClassification: Is not learning, why is loss not changing? Weights not updated?

I am relatively new to PyTorch and Huggingface-transformers and experimented with DistillBertForSequenceClassification on this Kaggle-Dataset.

from transformers import DistilBertForSequenceClassification
import torch.optim as optim
import torch.nn as nn
from transformers import get_linear_schedule_with_warmup

n_epochs = 5 # or whatever
batch_size = 32 # or whatever

bert_distil = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
#bert_distil.classifier = nn.Sequential(nn.Linear(in_features=768, out_features=1), nn.Sigmoid())
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(bert_distil.parameters(), lr=0.1)

X_train = []
Y_train = []

for row in train_df.iterrows():
    seq = tokenizer.encode(preprocess_text(row[1]['text']),  add_special_tokens=True, pad_to_max_length=True)
    X_train.append(torch.tensor(seq).unsqueeze(0))
    Y_train.append(torch.tensor([row[1]['target']]).unsqueeze(0))
X_train = torch.cat(X_train)
Y_train = torch.cat(Y_train)

running_loss = 0.0
bert_distil.cuda()
bert_distil.train(True)
for epoch in range(n_epochs):
    permutation = torch.randperm(len(X_train))
    j = 0
    for i in range(0,len(X_train), batch_size):
        optimizer.zero_grad()
        indices = permutation[i:i+batch_size]
        batch_x, batch_y = X_train[indices], Y_train[indices]
        batch_x.cuda()
        batch_y.cuda()
        outputs = bert_distil.forward(batch_x.cuda())
        loss = criterion(outputs[0],batch_y.squeeze().cuda())
        loss.requires_grad = True
   
        loss.backward()
        optimizer.step()
   
        running_loss += loss.item()  
        j+=1
        if j == 20:   
            #print(outputs[0])
            print('[%d, %5d] running loss: %.3f loss: %.3f ' %
              (epoch + 1, i*1, running_loss / 20, loss.item()))
            running_loss = 0.0
            j = 0

[1, 608] running loss: 0.689 loss: 0.687 [1, 1248] running loss: 0.693 loss: 0.694 [1, 1888] running loss: 0.693 loss: 0.683 [1, 2528] running loss: 0.689 loss: 0.701 [1, 3168] running loss: 0.690 loss: 0.684 [1, 3808] running loss: 0.689 loss: 0.688 [1, 4448] running loss: 0.689 loss: 0.692 etc...

Regardless on what I tried, loss did never decrease, or even increase, nor did the prediction get better. It seems to me that I forgot something so that weights are actually not updated. Someone has an idea? O

what I tried

  • Different loss functions
    • BCE
    • CrossEntropy
    • even MSE-loss
  • One-Hot Encoding vs A single neuron output
  • Different learning rates, and optimizers
  • I even changed all the targets to only one single label, but even then, the network did'nt converge.


Solution 1:[1]

Looking at running loss and minibatch loss is easily misleading. You should look at epoch loss, because the inputs are the same for every loss.

Besides, there are some problems in your code, fixing all of them and the behavior is as expected: the loss slowly decreases after each epoch, and it can also overfit to a small minibatch. Please look at the code, changes include: using model(x) instead of model.forward(x), cuda() only called once, smaller learning rate, etc.

Tuning and fine-tuning ML models are difficult work.

n_epochs = 5
batch_size = 1

bert_distil = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(bert_distil.parameters(), lr=1e-3)

X_train = []
Y_train = []
for row in train_df.iterrows():
    seq = tokenizer.encode(row[1]['text'],  add_special_tokens=True, pad_to_max_length=True)[:100]
    X_train.append(torch.tensor(seq).unsqueeze(0))
    Y_train.append(torch.tensor([row[1]['target']]))
X_train = torch.cat(X_train)
Y_train = torch.cat(Y_train)

running_loss = 0.0
bert_distil.cuda()
bert_distil.train(True)
for epoch in range(n_epochs):
    permutation = torch.randperm(len(X_train))
    for i in range(0,len(X_train), batch_size):
        optimizer.zero_grad()
        indices = permutation[i:i+batch_size]
        batch_x, batch_y = X_train[indices].cuda(), Y_train[indices].cuda()
        outputs = bert_distil(batch_x)
        loss = criterion(outputs[0], batch_y)
        loss.backward()
        optimizer.step()
   
        running_loss += loss.item()  

    print('[%d] epoch loss: %.3f' %
      (epoch + 1, running_loss / len(X_train) * batch_size))
    running_loss = 0.0

Output:

[1] epoch loss: 0.695
[2] epoch loss: 0.690
[3] epoch loss: 0.687
[4] epoch loss: 0.685
[5] epoch loss: 0.684

Solution 2:[2]

I would highlight two possible reasons for your "stable" results:

  1. I agree that the learning rate is surely too high that prevents model from any significant updates.
  2. But what is important to know is that based on the state-of-the-art papers finetuning has very marginal effect on the core NLP abilities of Transformers. For example, the paper says that finetuning only applies really small weight changes. Citing it: "Finetuning barely affects accuracy on NEL, COREF and REL indicating that those tasks are already sufficiently covered by pre-training". Several papers suggest that finetuning for classification tasks is basically waste of time. Thus, considering that DistilBert is actually a student model of BERT, maybe you won't get better results. Try pre-training with your data first. Generally, pre-training has a more significant impact.

Solution 3:[3]

I have got similar problem when I tried to use xxxForSequenceClassification to fine-tune my down-stream task.

At last, I changed xxxForSequenceClassification to xxxModel and added Dropout - FC - Softmax. Magically it's solved, loss decreased as expected.

I'm still trying to find out why.

Hope it may help you.

FYI, transformers verion: 3.5.0

Solution 4:[4]

Maybe the poor performance is due to gradients being applied to the BERT backbone. Validate it like so:

print([p.requires_grad for p in bert_distil.distilbert.parameters()])

As an alternative solution, try freezing the weights of your trained model:

for param in bert_distil.distilbert.parameters():
    param.requires_grad = False

As you are trying to optimize the weights of a trained model during fine-tuning on your data, you face issues described, among other sources, in the ULMIfit (https://arxiv.org/abs/1801.06146) paper

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 THN
Solution 2 SvGA
Solution 3 ansvver
Solution 4 makewithplus