'pre-trained BERT with Sigmoid not training
I am using a pre-trained BERT model from the transformers library to fine-tune for text classification, i.e. two class text classification. My last layer is a Sigmoid. In particular, this is my model:
class BertClassifier(nn.Module):
def __init__(self):
super(BertClassifier, self).__init__()
self.bert = BertModel.from_pretrained('bert-base-uncased')
self.classifier = nn.Sequential(
nn.Linear(768, 50), nn.ReLU()
nn.Linear(50, 1), nn.Sigmoid()
)
def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids=input_ids,
attention_mask=attention_mask)
last_hidden_state_cls = outputs[0][:, 0, :]
prediction = self.classifier(last_hidden_state_cls)
return prediction
For training, I use a batch size of 32, and Adam optimizer with learning rate 3e-4. I use torch.nn.BCELoss() as my loss function which, crucially, accpets probabilities and labels (not logits). However, I notice that all predicted probabilities are around 0.5 (between 0.45-0.55) and this never really changes over the course of training. Why could this be happening? If I remove the Sigmoid and just output scores (not probabilities), with the torch.nn.CrossEntropyLoss() function it seems to do better. What problem does the Sigmoid face here?
Could it be:
- Learning rate
- Vanishing gradients
- the ReLU
or anything else?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
