'How do I interpret a quickly converging Q loss/value function loss in actor-critic?

I am researching an application of actor-critic RL in a nonstationary environment and the loss of the Q-network (or, if I also implement a value function network, that loss too) quickly converges to zero, well before the network finds the optimal policy.

The architecture is kind of successful in finding a good policy, even though it is not very robust to perturbations, and I suspect that the Q-loss converging this fast is revealing of its inability to estimate the state-value or value function correctly. The environment being nonstationary makes it even more suspect, since there should always be some degree of error in the estimation. Any ideas as to what might be causing this?

Specifically, I am using soft actor-critic, and my implementation is based on OpenAI's Spinning Up repo. The optimization targets are as described in the paper [0], but I honestly find their code much more understandable - math in RL is usually not rigorous enough to really make sense of it. Anyway these are the expressions for the value function target:

and for the Q-function target:

Where \theta\, \psi and \bar{\psi} are neural networks (Q-function, main value and target value network respectively). I slightly modify these equations to optimize for average reward rate since my task is continuous, see [3], and include entropy regularization when taking the log probability of the action given by policy.

My Q- and value functions are simple MLPs:

# Soft Actor-Critic from OpenAI https://github.com/openai/spinningup/tree/master/spinup/algos/pytorch/sac
def mlp(sizes, activation, output_activation=nn.Identity):
    layers = []
    for j in range(len(sizes) - 1):
        act = activation if j < len(sizes) - 2 else output_activation
        layers += [nn.Linear(sizes[j], sizes[j + 1]), act()]
    return nn.Sequential(*layers)

class MLPQFunction(nn.Module):
    def __init__(self, obs_dim, act_dim, hidden_sizes, activation):
        super().__init__()
        self.q = mlp([obs_dim + act_dim] + list(hidden_sizes) + [1], activation)

    def forward(self, obs, act):
        q = self.q(torch.cat([obs, act], dim=-1))
        return torch.squeeze(q, -1)  # Critical to ensure q has right shape.


class MLPValueFunction(nn.Module):
    def __init__(self, obs_dim, hidden_sizes, activation):
        super().__init__()
        self.v = mlp([obs_dim] + list(hidden_sizes) + [1], activation)

    def forward(self, obs):
        v = self.v(obs)
        return v.squeeze()

and I compute the losses this way, after sampling a tuple of batches (o, a, r, o2) from a replay buffer. Each variable is [batch x dim(S)] if it's an observation, where dim(S) is the dimension of the state space, 2 in my case, or [batch x 1] if it's an action or reward.

q1 = net.q1(o, a)
q2 = net.q2(o, a)

# Bellman backup for Q functions
with torch.no_grad():
# Target actions come from *current* policy
    a2, logp_a2 = net.pi(o2)
# Target Q-values
    q1_pi_targ = target_net.q1(o2, a2)
    q2_pi_targ = target_net.q2(o2, a2)
    q_pi_targ = torch.min(q1_pi_targ, q2_pi_targ)
    backup = r - avg_reward + (q_pi_targ - temp * logp_a2)
    
# MSE loss against Bellman backup
loss_q1 = F.smooth_l1_loss(q1, backup)
loss_q2 = F.smooth_l1_loss(q2, backup)
loss_q = loss_q1 + loss_q2

q_optimizer.zero_grad(set_to_none=True)
loss_q.backward()
q_optimizer.step()

# Compute value function loss
v_optimizer.zero_grad(set_to_none=True)
vf = net.v(o)
with torch.no_grad():
    vf_target = q_pi - temp * logp_pi
loss_v = F.smooth_l1_loss(vf, vf_target)
loss_v.backward()
v_optimizer.step()

Where avg_reward is estimated as a running average:

avg_reward += AVG_REW_LR * (R - avg_reward + target_net.v(next_state.squeeze()) - target_net.v(state.squeeze()))

[0] Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. Proceedings of the 35th International Conference on Machine Learning, 1861–1870. https://proceedings.mlr.press/v80/haarnoja18b.html

[3] Naik, A., Shariff, R., Yasui, N., Yao, H., & Sutton, R. S. (2019). Discounted Reinforcement Learning Is Not an Optimization Problem. ArXiv:1910.02140 [Cs]. http://arxiv.org/abs/1910.02140

optimization reinforcement-learning

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How do I interpret a quickly converging Q loss/value function loss in actor-critic?

Sources

Related Questions