'PyTorch Lightning training console output is weird

When training a PyTorch Lightning model in a Jupyter Notebook, the console log output is awkward:

Epoch 0: 100%|█████████▉| 2315/2318 [02:05<00:00, 18.41it/s, loss=1.69, v_num=26, acc=0.562]
Validating: 0it [00:00, ?it/s]
Validating:   0%|          | 0/1 [00:00<?, ?it/s]
Epoch 0: 100%|██████████| 2318/2318 [02:09<00:00, 17.84it/s, loss=1.72, v_num=26, acc=0.500, val_loss=1.570, val_acc=0.564]
Epoch 1: 100%|█████████▉| 2315/2318 [02:04<00:00, 18.63it/s, loss=1.56, v_num=26, acc=0.594, val_loss=1.570, val_acc=0.564]
Validating: 0it [00:00, ?it/s]
Validating:   0%|          | 0/1 [00:00<?, ?it/s]
Epoch 1: 100%|██████████| 2318/2318 [02:08<00:00, 18.07it/s, loss=1.59, v_num=26, acc=0.528, val_loss=1.490, val_acc=0.583]
Epoch 2: 100%|█████████▉| 2315/2318 [02:01<00:00, 19.02it/s, loss=1.53, v_num=26, acc=0.617, val_loss=1.490, val_acc=0.583]
Validating: 0it [00:00, ?it/s]
Validating:   0%|          | 0/1 [00:00<?, ?it/s]
Epoch 2: 100%|██████████| 2318/2318 [02:05<00:00, 18.42it/s, loss=1.57, v_num=26, acc=0.500, val_loss=1.460, val_acc=0.589]

Expectingly, the "correct" output from the same training should be:

Epoch 0: 100%|██████████| 2318/2318 [02:09<00:00, 17.84it/s, loss=1.72, v_num=26, acc=0.500, val_loss=1.570, val_acc=0.564]
Epoch 1: 100%|██████████| 2318/2318 [02:08<00:00, 18.07it/s, loss=1.59, v_num=26, acc=0.528, val_loss=1.490, val_acc=0.583]
Epoch 2: 100%|██████████| 2318/2318 [02:05<00:00, 18.42it/s, loss=1.57, v_num=26, acc=0.500, val_loss=1.460, val_acc=0.589]

How comes epoch lines are uselessly repeated and split in this manner? Also I'm not sure what use the Validating lines are, since they don't seem to provide any information.

Training and validation steps from the model are as follow:

    def training_step(self, train_batch, batch_idx):
        x, y = train_batch
        y_hat = self.forward(x)
        loss = torch.nn.NLLLoss()(torch.log(y_hat), y.argmax(dim=1)) 
        acc = tm.functional.accuracy(y_hat.argmax(dim=1), y.argmax(dim=1))
        self.log("acc", acc, prog_bar=True)
        return loss

    def validation_step(self, valid_batch, batch_idx):
        x, y = valid_batch
        y_hat = self.forward(x)
        loss = torch.nn.NLLLoss()(torch.log(y_hat), y.argmax(dim=1)) 
        acc = tm.functional.accuracy(y_hat.argmax(dim=1), y.argmax(dim=1))
        self.log("val_loss", loss, prog_bar=True)
        self.log("val_acc", acc, prog_bar=True)


Solution 1:[1]

I've had this problem before when terminal windows are resized. The default PL progress bar uses tqdm and you can have issues if tqdm doesn't redraw the screen correctly.

The PL docs mention another, "rich" progress bar you might try instead, and also discuss how to write your own.

Solution 2:[2]

By default, Trainer is configured to run the validation loop after each epoch. You can change this setting using check_val_every_n_epoch flag in Trainer. See docs here.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Will Brannon
Solution 2 Harsh