'A2C not converge as the loss explode

I'm experimenting with Advantage Actor Critic algorithm, and the loss explode exponentially.

like

iteration actor_loss critic_loss
17 -0.072878 0.003239
78 -256202.2 254041.0
428 -1.02e+17 7.17e+16

The actor network and critic network share the same base network and have separate consequent layers.

I've checked that the predicted value is also exploding to (negative) the same order of magnitude as the losses.

The update session is like this:

    for log_prob, value, R in zip(log_prob_list, value_list, returns):
        advantage = R - value
        # actor loss
        policy = -log_prob * advantage
        policy_losses.append(policy.mean())

        # critic loss
        value_losses.append(F.smooth_l1_loss(value, R))

    loss1 = torch.stack(policy_losses).sum()
    loss2 = torch.stack(value_losses).sum()

    loss = loss1 + beta * loss2 - 0.001 * total_entropy
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

I'm wondering what's wrong, somehow I think the explosion is reasonable as the network outputs low value close to negative infinity which makes advantage close to positive infinity and policy_loss -> negative infinity.

I've tried increase beta to even 1e4 but this didn't help.

Can anyone correct/help me?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source