'CUDA memory issues in PyTorch
I'm trying to fine tune BERT for a classification task in Google Colab, and I often get the CUDA out of memory error. But this is a bit strange:
- If I run my training loop with batch size, say, 20 it might initially be okay for a while. Then suppose I stop the code and try again, it'll give me CUDA memory issues. Even if I reduce the batch size to 1, it gives me the same error. However, if I restart runtime (or factory restart), it runs fine. What causes memory issues over multiple runs? I don't think I'm storing anything, so the only thing which should matter is batch size.
- My Colab session often crashes due to RAM after multiple runs, but works okay for a bit after restarting runtime.
Here is the code I'm using for training. The class Model is my model, which consists of pre-trained BERT (Huggingface transformers library) and a few layers on top.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Model().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr = 3e-4)
criterion = nn.BCELoss().to(device)
batch_size = 32
train_data_loader = utils.data.DataLoader(data, batch_size = batch_size) #data is subclass of torch.utils.data.Dataset
def train(n_epochs):
model.train()
for epoch in range(n_epochs):
total_loss = 0
for embeddings, labels in train_data_loader:
labels = (labels.double()).to(device)
input_ids = (embeddings['input_ids'].squeeze(1)).to(device)
attention_mask = (embeddings['attention_mask'].squeeze(1)).to(device)
output = model(input_ids, attention_mask)
loss = criterion(output, labels)
total_loss += loss.item()
model.zero_grad()
loss.backward()
optimizer.step()
print(f'epoch {epoch+1}, total loss = {total_loss}')
total_epochs = 10
train(total_epochs)
Is there anything wrong or inefficient about my training loop causing the error?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
