'REINFORCE algorithm for a continuous action space

I have recently started exploring and playing around with reinforcement learning, and have managed to wrap my head around discrete action spaces, and have working implementations of a few environments in OpenAI Gym using Q-learning and Expected SARSA. However, I am running into some trouble understanding the handling of continuous action spaces.

From what I have understood so far, I have constructed a neural network that outputs the mean of a Gaussian distribution, with the standard deviation being fixed for now. Then using the output from the neural network I sample an action from the Gaussian distribution and perform this action in the environment. For each step in an episode I save the starting state, action and reward. Once the episode is over I am supposed to train the network, but this is were I am struggling.

From what I understand, the loss of policy network is calculated by the log-probability of the chosen action multiplied by the discounted reward of that action. For discrete actions this seems straightforward enough, have a softmax layer as your final layer and define a custom loss function that defines the loss as the logarithm of the softmax output layer multiplied by the target value which we set to be the discounted reward.

But how do you do this for a continuous action? The neural network outputs the mean, not the probability of an action or even the action itself, so how do I define a loss function to pass to keras to perform the learning step in TensorFlow for the continuous case?

I have read through a variety of articles on policy optimization, and while the article might mention the continuous case, all of the associated code always focuses on the discrete action space case for policy optimization, which is starting to become fairly disheartening. Can someone help me understand how to implement the continuous case in TensorFlow 2.0?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'REINFORCE algorithm for a continuous action space

Sources

Related Questions