'Manually/explicitly calculate gradients of Conv kernels

1. Background:

I can calculate the gradient of x with respect to a cost function loss in two ways: (1) manually writing out the explicit and analytic formula, and (2) using torch.autograd package. Here is my example:

import torch
import torch.nn.functional as F

for i in range(10):
    x = torch.randn(8, 1, 128, 128)
    y = torch.randn(8, 512, 4, 4)
    k = torch.randn(512, 1, 32, 32)

    loss = lambda z: 0.5 * (F.conv2d(z, k, stride=32) - y).pow(2).sum(dim=[1,2,3])  # cost function is [(1/2)||k*x-y||_F^2]

    # 1: calculate gradient of x explicitly and manually
    x_grad_manual = F.conv2d(x, k, stride=32) - y
    x_grad_manual = F.conv_transpose2d(x_grad_manual, k, stride=32)

    # 2: calculate gradient of x using torch.autograd
    x_var = torch.autograd.Variable(x, requires_grad=True)
    x_var_loss = loss(x_var)
    x_grad_auto = torch.autograd.grad(x_var_loss, x_var, torch.ones_like(x_var_loss), create_graph=True, retain_graph=True)[0]

    # check if the results of implementations 1 and 2 are equal
    print((x_grad_manual - x_grad_auto).pow(2).mean())

Since the mean squared errors of the results of the above two implementations are very small (about 3.4*10^(-8)), I think that their should be mutually matched and the manual implementation works correctly.

2. My Problem

I am confused by how to explicitly write out the gradients of variables (features and Conv kernels) conveniently with some compound processes? For instance, I do not know how to calculate the gradients of feature x and Conv kernel w1 in the following context:

import torch
import torch.nn.functional as F

for i in range(10):
    x = torch.randn(8, 32, 128, 128)
    y = torch.randn(8, 512, 4, 4)
    k = torch.randn(512, 1, 32, 32)
    w1 = torch.randn(1, 32, 3, 3)

    def loss(z, w):
        z_forward = F.conv2d(z, w, padding=1)  # z = w1 * x
        return 0.5 * (F.conv2d(z_forward, k, stride=32) - y).pow(2).sum(dim=[1,2,3])  # cost function is [(1/2)||k*z-y||_F^2]

    # 1: calculate gradients of x and w1 explicitly and manually
    x_grad_manual = ???
    w1_grad_manual = ???

    # 2: calculate gradients of x and w1 using torch.autograd
    x_var = torch.autograd.Variable(x, requires_grad=True)
    w1_var = torch.autograd.Variable(w1, requires_grad=True)
    x_var_loss = loss(x_var, w1_var)
    x_grad_auto, w1_grad_auto = torch.autograd.grad(x_var_loss, [x_var, w1_var], torch.ones_like(x_var_loss), create_graph=True, retain_graph=True)

    # check if the results of implementations 1 and 2 are equal
    print((x_grad_manual - x_grad_auto).pow(2).mean())
    print((w1_grad_manual - w1_grad_auto).pow(2).mean())

3. Extension:

Furthermore, if the forwarding process is more complicated than the above one, with two middle Conv layers and a ReLU activation, how can I write out the gradients? Please see the following problem:

import torch
import torch.nn.functional as F

for i in range(10):
    x = torch.randn(8, 32, 128, 128)
    y = torch.randn(8, 512, 4, 4)
    k = torch.randn(512, 1, 32, 32)
    w1 = torch.randn(32, 32, 3, 3)
    w2 = torch.randn(1, 32, 3, 3)

    def loss(z, q1, q2):
        z_forward = F.conv2d(z, q1, padding=1)  # z = w1 * x
        z_forward = F.relu(z_forward, inplace=True)  # z = ReLU(w1 * x)
        z_forward = F.conv2d(z_forward, q2, padding=1)  # z = w2 * ReLU(w1 * x)
        return 0.5 * (F.conv2d(z_forward, k, stride=32) - y).pow(2).sum(dim=[1,2,3])  # cost function is [(1/2)||k*z-y||_F^2]

    # 1: calculate gradients of x, w1 and w2 explicitly and manually
    x_grad_manual = ???
    w1_grad_manual = ???
    w2_grad_manual = ???

    # 2: calculate gradients of x, w1 and w2 using torch.autograd
    x_var = torch.autograd.Variable(x, requires_grad=True)
    w1_var = torch.autograd.Variable(w1, requires_grad=True)
    w2_var = torch.autograd.Variable(w2, requires_grad=True)
    x_var_loss = loss(x_var, w1_var, w2_var)
    x_grad_auto, w1_grad_auto, w2_grad_auto = torch.autograd.grad(x_var_loss, [x_var, w1_var, w1_var], torch.ones_like(x_var_loss), create_graph=True, retain_graph=True)

    # check if the results of implementations 1 and 2 are equal
    print((x_grad_manual - x_grad_auto).pow(2).mean())
    print((w1_grad_manual - w1_grad_auto).pow(2).mean())
    print((w2_grad_manual - w2_grad_auto).pow(2).mean())

4. Guarantee of Differentiability

Like my first example, I hope that the manual gradient calculations are totally explicit and themselves are differential, such that I can inject some of the processes in my neural network implementation. Could you please teach me how to achieve this?

5. The Reason of Posting This Problem

In a neural network I constructed, it is needed to calculate the gradients of some features and Conv kernels with respect to my pre-defined cost functions (as you can see above). In my current implementations, I directly employ torch.autograd package to calculate various gradients. However, it seems that there are some mistakes accumulated which misleads the learning process when I train such a neural network.

(The whole neural network has its own loss function and backward process. I just added some extra inner gradient calculations to achieve my goals.)

I conjecture that I should calculate the gradients manually and not directly use torch.autograd in a common network forwarding process, since some computational graphs and backwards may be nested and lead to the wrong weight updates.

In my experiments, I train two networks (with manual and auto-calculations, like the first example) and get similar results. But when I extend to more complicated forwardings (like my posted two problems), the training processes would not be stable. So I want to manually write out the gradients to avoid the implementation mistakes and conduct more experiements.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source