'Google Colab GPU RAM depletes quickly on test data but not so on training data

I am training my neural network built with PyTorch under Google Colab Pro+ (Tesla P100-PCIE GPU) but encounters the following strange phenomenon:

  1. The amount of free memory (Mem Free) stays nearly constant when I train the model on training data, so there is no GPU memory issue during training (even over multiple epochs, where I use torch.cuda.empty_cache() at the beginning of each epoch).

  2. However, Mem Free quickly depletes as soon as I deploy the model on test data after each training epoch. Everything except the input data to the network stays the same during the process.

To be more precise, the code during which Mem Free depletes looks like

self.batch_training(train=True)  # no problem
self.batch_training(train=False) # issue occurs,

where self is a class object which stores the model, data, and the training method.

The function batch_training looks like

def batch_training(self, train=True):
    if train:
       X, Y = self.X_train, self.Y_train
    else:
       X, Y = self.X_test, self.Y_test
    ...
    (make prediction and does gradient descent if train == True, otherwise, just compute customized losses)

Below is a screenshot that illustrates the situation (just ignores the training loss_g and in each batch), where I tried the code on a subset of training data so it moves on to the test data quickly. We can see the issue as soon as Test metrics are computed.

Screenshot of the issue (sorry could not make it inline due to ''not enough Reputation'')

Thank you very much for your help!



Solution 1:[1]

For anyone who might encounter similar issues:

As explained in PyTorch FAQ,

Don’t accumulate history across your training loop. By default, computations involving variables that require gradients will keep history. This means that you should avoid using such variables in computations which will live beyond your training loops, e.g., when tracking statistics. Instead, you should detach the variable or access its underlying data.

In my case, adding with torch.no_grad() before my self.batch_training(train=False) resolves the issue.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 cxu