'Training loss much lower than test loss even though it's using the same data

I'm using the same data for training and testing (which isn't best practice), and in theory the loss should be exactly the same. However, when training, my loss is usually around 1e-07 while during testing it's actually around 0.1

Ideally, the training loss should be very similar between training and testing but instead they are very different.

I'm loading data like this:

class osuDataSet(torch.utils.data.Dataset):
    def __init__(self):
        self.indexes = []
        self.imgs = []
        for img in os.listdir("data/imgs/"):
            index = img.split(".")[0]
            # remove all non-numeric characters
            index = "".join(filter(str.isdigit, index))
            self.indexes.append(int(index))
            # self.imgs.append(PIL.Image.open("data/imgs/" + img))
            self.imgs.append("data/imgs/" + img)
        
    def __len__(self):
        return len(self.indexes) if self.indexes is not None else 0
    
    def __getitem__(self, index):
        img = PIL.Image.open(self.imgs[index])
        img = transforms.functional.pil_to_tensor(img).to(device)
        img = img.type(torch.FloatTensor)
        img = img.to(device)
        return (torch.tensor(data[index]), img)

Here's my training loop:

    losses = []
    try:
        for epoch in range(args.epochs):
            for (index, img) in trainDataloader:
                for i, image in enumerate(img):
                    # circlenet.zero_grad()
                    optimizer.zero_grad()
                    output = circlenet(image)
                    # print(index[i])
                    loss = nn.functional.mse_loss(output, torch.tensor(
                        index[i]
                    , device=device, dtype=torch.float))
                    loss = loss.to(device)
                    loss.backward()
                    optimizer.step()

            if epoch % 50 == 0:
                if args.cuda:
                    GPUs = GPUtil.getGPUs()
                    print(GPUs[0].temperature, "C")
            
            if epoch % args.saveevery == 0:
                circlenet.cpu()
                torch.save({"model": circlenet.state_dict(), "optimizer": optimizer.state_dict()}, f"{args.save_dir}/weights.pth")
                circlenet.to(device)

            losses.append(loss.item())
            print(f"Epoch: {epoch + 1: <6} Loss: {loss.item()}") 
    except KeyboardInterrupt:
        torch.save({"model": circlenet.state_dict(), "optimizer": optimizer.state_dict()}, f"{args.save_dir}/weights.pth")
    import matplotlib.pyplot as plt
    plt.plot(losses)
    plt.show()

Here's how I'm testing:

    rand = random.randint(0, len(os.listdir("data/imgs/")) - 1)
    import cv2
    # use the network
    circlenet.eval()
    img = (PIL.Image.open(f"data/imgs/img{rand}.jpg"))
    img = transforms.functional.pil_to_tensor(img).to(device)
    img = img.type(torch.FloatTensor)
    img = img.to(device)
    with torch.no_grad():
        out = circlenet(img)
    out = out.cpu().numpy()
    out = out.tolist()
    imgcv = cv2.imread(f"data/imgs/img{rand}.jpg")
    print("Output: ", out)
    print(rand)
    # remove first and last characters
    ans = data[rand - 1] 
    print("Answer: ", ans)
    loss = nn.functional.mse_loss(torch.tensor(out, dtype=torch.float, device=device), torch.tensor(ans, dtype=torch.float, device=device))
    print("Loss: ", loss.item())
    cv2.circle(imgcv, (round(ans[1] * 256), round(ans[2] * 144)), 2, (255, 255, 0), 2) # answer
    color = (0, 255, 0) if round(out[0]) == 1 else (0, 0, 255)
    cv2.circle(imgcv, (round(out[1] * 256), round(out[2] * 144)), 4, color, 2)
    imgcv = cv2.resize(imgcv, (480, 270))
    cv2.imshow("output", imgcv)
    cv2.waitKey(0)

Some outputs for training:

Epoch: 94     Loss: 7.115558560144564e-07
Epoch: 95     Loss: 5.9022491768701e-05
Epoch: 96     Loss: 2.5865596398944035e-05
Epoch: 97     Loss: 9.173281227958796e-07
Epoch: 98     Loss: 8.050536962400656e-06
Epoch: 99     Loss: 8.39896165416576e-06
Epoch: 100    Loss: 7.107677788553701e-07

Output for testing:

You are running on device: NVIDIA GeForce RTX 3050 Ti Laptop GPU
Current statistics:
| ID | GPU | MEM |
------------------
|  0 | 40% | 12% |
55.0 C
Output:  [0.9986587166786194, 0.6712906360626221, 0.6456944346427917]
870
Answer:  [1.0, 0.3328125, 0.8268518518518518]
Loss:  0.04912909120321274


Solution 1:[1]

Sorry, I cannot write a reply due to my low reputation. Although I cannot directly answer your question, there are two main points you should consider in your code:

  1. Loss values in two snips are not computed from the same set of inputs. In the training code, the loss value is computed from the last image in the last batch of an epoch. In the test code, the input image is just randomly chosen. Hence, these two images are almost never the same.
  2. In the training phase, you load data from a dataloader with unknown pre-processing steps while in the test time, you just feed an image (as a tensor) to the network.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 cao-nv