'Is this backpropagations actually correct?
I was watching this tutorial on how to implement a simple vanilla RNN, code can be found here.
It was all going well until he started implementing backpropagation.
What I know about backpropagation in RNNs is that the derivative of the loss function with respect to the weights matrix of the hidden state (∂L(t)/∂W) at a timestep t depends on all of the previous timesteps (t, t-1, t-2, ..., 2, 1).
A slide from this lecture:

This is not being done in the tutorial. I tried to take a look at some other sources (like here) and found a couple that're doing the same.
This is how it's being done:
def lossFun(inputs, targets, hprev):
"""
inputs,targets are both list of integers.
hprev is Hx1 array of initial hidden state
returns the loss, gradients on model parameters, and last hidden state
"""
#store our inputs, hidden states, outputs, and probability values
xs, hs, ys, ps, = {}, {}, {}, {} #Empty dicts
# Each of these are going to be SEQ_LENGTH(Here 25) long dicts i.e. 1 vector per time(seq) step
# xs will store 1 hot encoded input characters for each of 25 time steps (26, 25 times)
# hs will store hidden state outputs for 25 time steps (100, 25 times)) plus a -1 indexed initial state
# to calculate the hidden state at t = 0
# ys will store targets i.e. expected outputs for 25 times (26, 25 times), unnormalized probabs
# ps will take the ys and convert them to normalized probab for chars
# We could have used lists BUT we need an entry with -1 to calc the 0th hidden layer
# -1 as a list index would wrap around to the final element
xs, hs, ys, ps = {}, {}, {}, {}
#init with previous hidden state
# Using "=" would create a reference, this creates a whole separate copy
# We don't want hs[-1] to automatically change if hprev is changed
hs[-1] = np.copy(hprev)
#init loss as 0
loss = 0
# forward pass
for t in xrange(len(inputs)):
xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation (we place a 0 vector as the t-th input)
xs[t][inputs[t]] = 1 # Inside that t-th input we use the integer in "inputs" list to set the correct
hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
# backward pass: compute gradients going backwards
#initalize vectors for gradient values for each set of weights
dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
dbh, dby = np.zeros_like(bh), np.zeros_like(by)
dhnext = np.zeros_like(hs[0])
for t in reversed(xrange(len(inputs))):
#output probabilities
dy = np.copy(ps[t])
#derive our first gradient
dy[targets[t]] -= 1 # backprop into y
#compute output gradient - output times hidden states transpose
#When we apply the transpose weight matrix,
#we can think intuitively of this as moving the error backward
#through the network, giving us some sort of measure of the error
#at the output of the lth layer.
#output gradient
dWhy += np.dot(dy, hs[t].T)
#derivative of output bias
dby += dy
#backpropagate!
dh = np.dot(Why.T, dy) + dhnext # backprop into h
dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
dbh += dhraw #derivative of hidden bias
dWxh += np.dot(dhraw, xs[t].T) #derivative of input to hidden layer weight
dWhh += np.dot(dhraw, hs[t-1].T) #derivative of hidden layer to hidden layer weight
dhnext = np.dot(Whh.T, dhraw)
for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]
More specifically, this loop is what I'm concerned with:
for t in reversed(xrange(len(inputs))):
#output probabilities
dy = np.copy(ps[t])
#derive our first gradient
dy[targets[t]] -= 1 # backprop into y
#compute output gradient - output times hidden states transpose
#When we apply the transpose weight matrix,
#we can think intuitively of this as moving the error backward
#through the network, giving us some sort of measure of the error
#at the output of the lth layer.
#output gradient
dWhy += np.dot(dy, hs[t].T)
#derivative of output bias
dby += dy
#backpropagate!
dh = np.dot(Why.T, dy) + dhnext # backprop into h
dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
dbh += dhraw #derivative of hidden bias
dWxh += np.dot(dhraw, xs[t].T) #derivative of input to hidden layer weight
dWhh += np.dot(dhraw, hs[t-1].T) #derivative of hidden layer to hidden layer weight
dhnext = np.dot(Whh.T, dhraw)
Am I missing something here?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
