'Shall I use grad.zero_() in PyTorch with or without gradient tracking?

I'm quite new to PyTorch, and I have a question about zeroing the gradients after an epoch. Suppose I have the following training loop:

for epoch in range(n_iters):
  y_hat = forward(X)
  l = loss(y, y_hat)
  with torch.no_grad():
    l.backward()
    w -= lr * w.grad

It is clear that in order not to have the gradients accumulated I need to zero the .grad attribute of w. However, I'm unsure about where to call w.grad.zero_(). I both found on internet tutorials where it was called in the no_grad() section and also where it was called out of it. So I tested them and they both worked fine for a simple linear regression.
Is any difference between the two? If there is, which is better to use?

Solution 1:^[1]

In your snippet that doesn't really matter. The underscore in the name of zero_() means it is an inplace function, and since w.grad.requires_grad == False we know that there won't be any gradient computation with respect to w.grad happening anyway. The only important thing is that it happens before the loss.backward() call.

I would recommend though to use different names for your loss function and the actuall loss tensor it computes, otherwise you're overwriting one with the other.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	flawr

'Shall I use grad.zero_() in PyTorch with or without gradient tracking?

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]