'Huggingface transformers) training loss sometimes decreases really slowly (using Trainer)

I'm fine-tuning sentiment analysis model using news data. As the simplest way is using Huggingface pre-trained model (roberta-base), I followed Huggingface tutorial - https://huggingface.co/blog/sentiment-analysis-python - this one. The custom input data is simple : There're 2 columns named 'text' and 'labels'. The column 'text' is consisted with news sentence and 'label' is consisted with '0' (40%) and '1' (60%). Then it was separated into train, eval, test set.

So this is the problem what I met : 'eval_loss' never changes during training but its accuracy passed 50%. And training loss is decreasing while training. So It seems learned something. Maybe it didn't learn after first epoch or selected best checkpoint automatically - but I'm confusing what is actually happened.

And this is the training code (without labeling code):

from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification
import numpy as np
from datasets import load_metric
from transformers import set_seed

set_seed(42)

dataset = load_dataset('json',data_files={'train':'./data/labeled_news/labeled_news_heads_train.json',
                                          'eval':'./data/labeled_news/labeled_news_heads_eval.json'}, field='data')
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

train_dataset = tokenized_datasets["train"].shuffle(seed=42)
eval_dataset = tokenized_datasets["eval"].shuffle(seed=42)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=2)


def compute_metrics(eval_pred):
    load_accuracy = load_metric("accuracy")
    load_f1 = load_metric("f1")

    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
    f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
    return {"accuracy": accuracy, "f1": f1}


from transformers import TrainingArguments, Trainer, EarlyStoppingCallback

repo_name = "Direct_v1"

training_args = TrainingArguments(
    output_dir=repo_name,
    learning_rate=2e-5,
    per_device_train_batch_size=24,
    per_device_eval_batch_size=1,
    num_train_epochs=5,
    weight_decay=0.01,
    save_strategy="steps",
    evaluation_strategy ='steps',
    eval_steps = 250,
    save_steps=250,
    push_to_hub=False,
    save_total_limit = 5,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

And this is the result printed on console:

Using custom data configuration default-e08b7987c7aa36c3
Reusing dataset json (/home/nvme20142249/.cache/huggingface/datasets/json/default-e08b7987c7aa36c3/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)
100%|██████████| 2/2 [00:00<00:00, 315.56it/s]
Loading cached processed dataset at /home/nvme20142249/.cache/huggingface/datasets/json/default-e08b7987c7aa36c3/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-050035fb0e59db40.arrow
Loading cached processed dataset at /home/nvme20142249/.cache/huggingface/datasets/json/default-e08b7987c7aa36c3/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-2981b391c69b5e0c.arrow
Loading cached shuffled indices for dataset at /home/nvme20142249/.cache/huggingface/datasets/json/default-e08b7987c7aa36c3/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-26ea42ee0127a8d9.arrow
Loading cached shuffled indices for dataset at /home/nvme20142249/.cache/huggingface/datasets/json/default-e08b7987c7aa36c3/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-ef064a1251721c99.arrow
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The following columns in the training set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
/home/nvme20142249/PycharmProjects/StockPrediction/venv/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 10147
  Num Epochs = 5
  Instantaneous batch size per device = 24
  Total train batch size (w. parallel, distributed & accumulation) = 24
  Gradient Accumulation steps = 1
  Total optimization steps = 2115
 12%|█▏        | 250/2115 [02:04<15:33,  2.00it/s]The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 634
  Batch size = 1
100%|██████████| 634/634 [00:14<00:00, 53.32it/s]
                                                 Saving model checkpoint to Direct_v1/checkpoint-250
Configuration saved in Direct_v1/checkpoint-250/config.json
{'eval_loss': 0.6686041951179504, 'eval_accuracy': 0.610410094637224, 'eval_f1': 0.7580803134182175, 'eval_runtime': 14.2853, 'eval_samples_per_second': 44.381, 'eval_steps_per_second': 44.381, 'epoch': 0.59}
Model weights saved in Direct_v1/checkpoint-250/pytorch_model.bin
tokenizer config file saved in Direct_v1/checkpoint-250/tokenizer_config.json
Special tokens file saved in Direct_v1/checkpoint-250/special_tokens_map.json
 24%|██▎       | 500/2115 [04:28<14:23,  1.87it/s]The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 634
  Batch size = 1
{'loss': 0.6803, 'learning_rate': 1.5271867612293146e-05, 'epoch': 1.18}
 24%|██▎       | 500/2115 [04:43<14:23,  1.87it/s]
100%|██████████| 634/634 [00:15<00:00, 49.78it/s]
Saving model checkpoint to Direct_v1/checkpoint-500
Configuration saved in Direct_v1/checkpoint-500/config.json

{'eval_loss': 0.6686403751373291, 'eval_accuracy': 0.610410094637224, 'eval_f1': 0.7580803134182175, 'eval_runtime': 15.0809, 'eval_samples_per_second': 42.04, 'eval_steps_per_second': 42.04, 'epoch': 1.18}

Model weights saved in Direct_v1/checkpoint-500/pytorch_model.bin
tokenizer config file saved in Direct_v1/checkpoint-500/tokenizer_config.json
Special tokens file saved in Direct_v1/checkpoint-500/special_tokens_map.json
 35%|███▌      | 750/2115 [06:56<11:30,  1.98it/s]
The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 634
  Batch size = 1
 35%|███▌      | 750/2115 [07:10<11:30,  1.98it/s]
100%|██████████| 634/634 [00:14<00:00, 51.95it/s]
Saving model checkpoint to Direct_v1/checkpoint-750
Configuration saved in Direct_v1/checkpoint-750/config.json

{'eval_loss': 0.6685948967933655, 'eval_accuracy': 0.610410094637224, 'eval_f1': 0.7580803134182175, 'eval_runtime': 14.3642, 'eval_samples_per_second': 44.138, 'eval_steps_per_second': 44.138, 'epoch': 1.77}

Model weights saved in Direct_v1/checkpoint-750/pytorch_model.bin
tokenizer config file saved in Direct_v1/checkpoint-750/tokenizer_config.json
Special tokens file saved in Direct_v1/checkpoint-750/special_tokens_map.json
 47%|████▋     | 1000/2115 [09:18<09:18,  2.00it/s]
The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 634
  Batch size = 1
{'loss': 0.6786, 'learning_rate': 1.054373522458629e-05, 'epoch': 2.36}
 47%|████▋     | 1000/2115 [09:32<09:18,  2.00it/s]
100%|██████████| 634/634 [00:14<00:00, 52.47it/s]
Saving model checkpoint to Direct_v1/checkpoint-1000
Configuration saved in Direct_v1/checkpoint-1000/config.json

{'eval_loss': 0.6686900854110718, 'eval_accuracy': 0.610410094637224, 'eval_f1': 0.7580803134182175, 'eval_runtime': 14.7566, 'eval_samples_per_second': 42.964, 'eval_steps_per_second': 42.964, 'epoch': 2.36}

Model weights saved in Direct_v1/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in Direct_v1/checkpoint-1000/tokenizer_config.json
Special tokens file saved in Direct_v1/checkpoint-1000/special_tokens_map.json
 59%|█████▉    | 1250/2115 [11:40<07:14,  1.99it/s]
The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 634
  Batch size = 1
 59%|█████▉    | 1250/2115 [11:54<07:14,  1.99it/s]
100%|██████████| 634/634 [00:14<00:00, 52.63it/s]
Saving model checkpoint to Direct_v1/checkpoint-1250
Configuration saved in Direct_v1/checkpoint-1250/config.json

{'eval_loss': 0.6696870923042297, 'eval_accuracy': 0.610410094637224, 'eval_f1': 0.7580803134182175, 'eval_runtime': 14.2725, 'eval_samples_per_second': 44.421, 'eval_steps_per_second': 44.421, 'epoch': 2.96}

Model weights saved in Direct_v1/checkpoint-1250/pytorch_model.bin
tokenizer config file saved in Direct_v1/checkpoint-1250/tokenizer_config.json
Special tokens file saved in Direct_v1/checkpoint-1250/special_tokens_map.json
 71%|███████   | 1500/2115 [14:01<05:09,  1.99it/s]
The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 634
  Batch size = 1
{'loss': 0.6798, 'learning_rate': 5.815602836879432e-06, 'epoch': 3.55}
 71%|███████   | 1500/2115 [14:16<05:09,  1.99it/s]
100%|██████████| 634/634 [00:14<00:00, 52.17it/s]
Saving model checkpoint to Direct_v1/checkpoint-1500
Configuration saved in Direct_v1/checkpoint-1500/config.json

{'eval_loss': 0.6706184148788452, 'eval_accuracy': 0.610410094637224, 'eval_f1': 0.7580803134182175, 'eval_runtime': 14.5084, 'eval_samples_per_second': 43.699, 'eval_steps_per_second': 43.699, 'epoch': 3.55}

Model weights saved in Direct_v1/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in Direct_v1/checkpoint-1500/tokenizer_config.json
Special tokens file saved in Direct_v1/checkpoint-1500/special_tokens_map.json
Deleting older checkpoint [Direct_v1/checkpoint-250] due to args.save_total_limit
 83%|████████▎ | 1750/2115 [16:25<03:03,  1.99it/s]
The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 634
  Batch size = 1
 83%|████████▎ | 1750/2115 [16:39<03:03,  1.99it/s]
100%|██████████| 634/634 [00:14<00:00, 50.95it/s]
Saving model checkpoint to Direct_v1/checkpoint-1750
Configuration saved in Direct_v1/checkpoint-1750/config.json

{'eval_loss': 0.6691468954086304, 'eval_accuracy': 0.610410094637224, 'eval_f1': 0.7580803134182175, 'eval_runtime': 14.515, 'eval_samples_per_second': 43.679, 'eval_steps_per_second': 43.679, 'epoch': 4.14}

Model weights saved in Direct_v1/checkpoint-1750/pytorch_model.bin
tokenizer config file saved in Direct_v1/checkpoint-1750/tokenizer_config.json
Special tokens file saved in Direct_v1/checkpoint-1750/special_tokens_map.json
Deleting older checkpoint [Direct_v1/checkpoint-500] due to args.save_total_limit
 95%|█████████▍| 2000/2115 [18:48<00:58,  1.95it/s]
The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 634
  Batch size = 1
{'loss': 0.6784, 'learning_rate': 1.087470449172577e-06, 'epoch': 4.73}
 95%|█████████▍| 2000/2115 [19:04<00:58,  1.95it/s]
100%|██████████| 634/634 [00:15<00:00, 50.16it/s]
Saving model checkpoint to Direct_v1/checkpoint-2000
Configuration saved in Direct_v1/checkpoint-2000/config.json

{'eval_loss': 0.6719586253166199, 'eval_accuracy': 0.610410094637224, 'eval_f1': 0.7580803134182175, 'eval_runtime': 15.2941, 'eval_samples_per_second': 41.454, 'eval_steps_per_second': 41.454, 'epoch': 4.73}

Model weights saved in Direct_v1/checkpoint-2000/pytorch_model.bin
tokenizer config file saved in Direct_v1/checkpoint-2000/tokenizer_config.json
Special tokens file saved in Direct_v1/checkpoint-2000/special_tokens_map.json
Deleting older checkpoint [Direct_v1/checkpoint-750] due to args.save_total_limit
100%|██████████| 2115/2115 [20:05<00:00,  2.05it/s]

Training completed. Do not forget to share your model on huggingface.co/models =)


100%|██████████| 2115/2115 [20:05<00:00,  1.75it/s]
{'train_runtime': 1205.4397, 'train_samples_per_second': 42.088, 'train_steps_per_second': 1.755, 'train_loss': 0.6791386345035922, 'epoch': 5.0}

I think this is quite weird because it seems learned something but eval_loss doesn't change while training. Does 'transformers.Trainer' select best checkpoint automatically? I'm confusing this is an error or not.

** edited on 4/25 : I changed compute_metrics function by

    load_accuracy = load_metric("accuracy")
    def compute_metrics(eval_pred):
        predictions, labels = eval_pred
        predictions = np.argmax(predictions, axis=1)
        return load_accuracy.compute(predictions=predictions, references=labels)

and training error decreased normally while training. I thought the problem was solved but, sometimes It doesn't. Training error didn't decrease with same datasets. (different checkpoints) Why did this happen?

sentiment-analysis huggingface-transformers

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Huggingface transformers) training loss sometimes decreases really slowly (using Trainer)

Sources

Related Questions