'model optimization inconsistency in pytorch lightning

(I am using pytorch lightning) I am trying to build a pointer network with transformer, meaning that the model would choose to take a token from the input as its output. The transformerPointer takes in input (batch_size, src_length) and target output: (batch_size, tgt_length) then computes an attention score of (batch_size, tgt_length, src_length)

#  tgt_length : length of model output 
#  src_length : length of model input

since this is a pointer network, the attention scores's last dimension has a size of (src_length) which represents the probability distribution of the correct output out of all the possible words chosen from the given input. (at this stage haven't been softmaxed yet)

Since we need to predict tgt_length number of words, it takes tgt_length number of probability distribution to guess the entire output. Hence the second dimension has size tgt_length


And now I wish to take attention score: (batch_size, tgt_length, src_length) and turn it into dimension of: (batch_size, tgt_length, src_vocab), in order to obtain a probability distribution among the entire vocabulary instead of just words in the given input. # src vocab is the size of the vocabulary.

The action I quoted above is carried out in the code below

class Generator(pl.LightningModule):
    def __init__(self):
        super(Generator, self).__init__()
        
    def forward(self, x, src, src_vocab):  # x is the output of the model so far, src is the model input, src_vocab is the size of the vocabulary   
        # x (batch_size, tgt_length, src_length)
        output = torch.tensor([])
        for count in range(0, x.size(1)):
            a = self.one_hot_format(x[:, count, :], src, src_vocab)
            output = torch.cat((output, a), dim=-2)
        return output
        # return shape (batch_size, tgt_length, src_vocab)

    def one_hot_format(self, x, src, src_vocab):
        # x (batch_size, src_length)
        # src (batch_size, src_length)
        # src_vocab(a scalar value)
        a = torch.zeros(x.size(0), src_vocab).fill_(-1e9)
        x = F.log_softmax(x)
        return a.scatter_(1, src.long(), x).unsqueeze(-2)
        # return shape (batch_size, 1, src_vocab)

gen = Generator()
print(gen(x, src, src_vocab))

The code above works fine, and the model is able to train successfully. (I compute the loss with the output of this generator and the one hot encoded version of target value.) However, when I changed the code a tiny bit as seen below. The model is not able to train anymore, the training loss will not decrease.


class GeneratorGiveError(pl.LightningModule):
    def __init__(self):
        super(GeneratorGiveError, self).__init__()
        
    def forward(self, x, src, src_vocab):
        output = torch.tensor([])
        # x (batch_size, tgt_length, src_length)
        for count in range(0, x.size(1)):
            a = self.one_hot_format(x[:, count, :], src, src_vocab)
            output = torch.cat((output, a), dim=-2)
        output = F.log_softmax(output, dim=-1)                 # this line is added 
        return output                       
        # return shape (batch_size, tgt_length, src_vocab)

    def one_hot_format(self, x, src, src_vocab):
        # x (batch_size, src_length)
        # src (batch_size, src_length)
        # src_vocab(scalar)
        a = torch.zeros(x.size(0), src_vocab).fill_(-1e9)
        # x = F.log_softmax(x)                                 # this line is removed 
        return a.scatter_(1, src.long(), x).unsqueeze(-2)
        # return shape (batch_size, 1, src_vocab)

The two classes I defined above do the exact same thing, as proof, see the code below.

gen = Generator()
gen1 = GeneratorGiveError()
a = torch.FloatTensor([[[10, 12, 13], [13, 14, 15]], [[5, 6, 7], [8, 5, 1]]])
t = torch.FloatTensor([[1, 3, 4], [2, 4, 5]])
print(gen.forward(a, t, 6))
print(gen1.forward(a, t, 6))

Output:

tensor([[[-1.0000e+09, -3.3490e+00, -1.0000e+09, -1.3490e+00, -3.4901e-01,
          -1.0000e+09],
         [-1.0000e+09, -2.4076e+00, -1.0000e+09, -1.4076e+00, -4.0761e-01,
          -1.0000e+09]],

        [[-1.0000e+09, -1.0000e+09, -2.4076e+00, -1.0000e+09, -1.4076e+00,
          -4.0761e-01],
         [-1.0000e+09, -1.0000e+09, -4.9456e-02, -1.0000e+09, -3.0495e+00,
          -7.0495e+00]]])
tensor([[[-1.0000e+09, -3.3490e+00, -1.0000e+09, -1.3490e+00, -3.4901e-01,
          -1.0000e+09],
         [-1.0000e+09, -2.4076e+00, -1.0000e+09, -1.4076e+00, -4.0761e-01,
          -1.0000e+09]],

        [[-1.0000e+09, -1.0000e+09, -2.4076e+00, -1.0000e+09, -1.4076e+00,
          -4.0761e-01],
         [-1.0000e+09, -1.0000e+09, -4.9456e-02, -1.0000e+09, -3.0495e+00,
          -7.0495e+00]]])

Both do the exact same thing but the latter doesn't allow the model to improve, I don't really understand. I assume there is something to do with how the backpropagation in pytorch_lightning work? I am very confused.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source