'Pytorch loss is nan

I'm trying to write my first neural network with pytorch. Unfortunately, I encounter a problem when I want to get the loss. The following error message:

RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.

So I tried debugging and found something strange. The input has no nans and infs as I verify with the following:

print(torch.any(torch.isnan(inputs)))

But if I always let the individual steps in the model x be output, I see that there will be inf at some point.

training

inputs, labels = data
print(torch.any(torch.isnan(inputs)))
optimizer.zero_grad()
outputs = model(inputs)
print(outputs)
loss = criterion(outputs, labels)
print(f"epoch: {epoch + 1} loss: {loss.item()}")
loss.backward()

optimizer.step()

model

class Net(Module):
    def __init__(self):
        super(Net, self).__init__()
        self.layer1 = Conv1d(in_channels=1, out_channels=5, kernel_size=5, stride=2, dtype=torch.float64)
        self.act1 = ReLU()
        self.pool1 = MaxPool1d(2)
        self.layer2 = Conv1d(in_channels=5, out_channels=1, kernel_size=2, dtype=torch.float64)
        self.fcl1 = Linear(1350, 16, dtype=torch.float64)

    def forward(self, x):
        print("raw", x)
        x = self.layer1(x)
        print("conv1d 1", x)
        x = self.act1(x)
        print("relu", x)
        x = self.layer2(x)
        print("conv1d 2", x)
        x = self.pool1(x)
        x = self.pool1(x)
        x = self.pool1(x)
        x = self.pool1(x)
        x = self.pool1(x)
        x = self.pool1(x)
        x = self.pool1(x)
        print("pools", x)
        x = self.fcl1(x)
        print("linear", x)
        return x

output

tensor(False)
raw tensor([[9.0616e+227, 2.4353e-152,  1.0294e-71,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00]], dtype=torch.float64)
conv1d 1 tensor([[   -inf,    -inf,    -inf,  ..., -0.2516, -0.2516, -0.2516],
        [    inf,     inf,     inf,  ...,  0.3377,  0.3377,  0.3377],
        [   -inf,    -inf,    -inf,  ...,  0.4285,  0.4285,  0.4285],
        [   -inf,    -inf,    -inf,  ..., -0.1230, -0.1230, -0.1230],
        [    inf,     inf,     inf,  ...,  0.3793,  0.3793,  0.3793]],
       dtype=torch.float64, grad_fn=<SqueezeBackward1>)
relu tensor([[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [   inf,    inf,    inf,  ..., 0.3377, 0.3377, 0.3377],
        [0.0000, 0.0000, 0.0000,  ..., 0.4285, 0.4285, 0.4285],
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [   inf,    inf,    inf,  ..., 0.3793, 0.3793, 0.3793]],
       dtype=torch.float64, grad_fn=<ReluBackward0>)
conv1d 2 tensor([[        -inf,         -inf,         -inf,  ..., -5.4167e+265,
         -5.4167e+265, -5.4167e+265]], dtype=torch.float64,
       grad_fn=<SqueezeBackward1>)
pools tensor([[        -inf, -5.4167e+265, -5.4167e+265,  ..., -5.4167e+265,
         -5.4167e+265, -5.4167e+265]], dtype=torch.float64,
       grad_fn=<SqueezeBackward1>)
linear tensor([[inf, inf, -inf, -inf, -inf, inf, inf, inf, inf, inf, inf, -inf, inf, inf, -inf, -inf]],
       dtype=torch.float64, grad_fn=<AddmmBackward0>)
tensor([[inf, inf, -inf, -inf, -inf, inf, inf, inf, inf, inf, inf, -inf, inf, inf, -inf, -inf]],
       dtype=torch.float64, grad_fn=<AddmmBackward0>)
epoch: 1 loss: nan

Thanks for helping



Solution 1:[1]

Sorry, my reputation is not enough for me to comment directly. This may be caused by the exploding gradient due to the excessive learning rate. It is recommended that you reduce the learning rate or use weight_decay.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 ki-ljl