'pre-trained BERT with Sigmoid not training

I am using a pre-trained BERT model from the transformers library to fine-tune for text classification, i.e. two class text classification. My last layer is a Sigmoid. In particular, this is my model:

class BertClassifier(nn.Module):
    def __init__(self):
        super(BertClassifier, self).__init__()

        self.bert = BertModel.from_pretrained('bert-base-uncased')

        self.classifier = nn.Sequential(
            nn.Linear(768, 50), nn.ReLU()
            nn.Linear(50, 1), nn.Sigmoid()
        ) 

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids,
                            attention_mask=attention_mask)
        
        last_hidden_state_cls = outputs[0][:, 0, :]
        prediction = self.classifier(last_hidden_state_cls)

        return prediction

For training, I use a batch size of 32, and Adam optimizer with learning rate 3e-4. I use torch.nn.BCELoss() as my loss function which, crucially, accpets probabilities and labels (not logits). However, I notice that all predicted probabilities are around 0.5 (between 0.45-0.55) and this never really changes over the course of training. Why could this be happening? If I remove the Sigmoid and just output scores (not probabilities), with the torch.nn.CrossEntropyLoss() function it seems to do better. What problem does the Sigmoid face here?

Could it be:

Learning rate
Vanishing gradients
the ReLU

or anything else?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'pre-trained BERT with Sigmoid not training

Sources

Related Questions