'Why is Softmax applied at axis=0 for Policy Gradient methods(REINFORCE algorithm)
I am going through the fourth chapter of the book - Deep Reinforcement Learning in Action(Manning Publication). This chapter explains the code to apply reinforcement learning to Cart-Pole game. There are only 2 actions possible - left,right. The model accepts state encoded as a vector and outputs probabilities corresponding to 2 actions. The model(developed in pytorch) has been defined as -
l1 = 4
l2 = 150
l3 = 2
model = torch.nn.Sequential(
torch.nn.Linear(l1, l2),
torch.nn.LeakyReLU(),
torch.nn.Linear(l2, l3),
torch.nn.Softmax(dim=0)
)
The Softmax layer at the end has been applied at axis=0. I can understand this as a means of creating probability distribution between 2 actions(left or right). So, when a state is given to this model it outputs a 2 dimensional vector of probabilities.For example - [0.25,0.75] However, what is puzzling me is that when a batch of states would be given to this model then the softmax(since it is applied on axis=0) would be applied across the first action set of the batch and then it would be applied to the second action set of the batch rather than getting applied to each action pair of the batch.So, the sum of first actions of the batch would be 1 and the sum of second actions of the batch would be 1. One sample of output of the model is shown below-
tensor([[0.0541, 0.0580],
[0.0542, 0.0556],
[0.0542, 0.0579],
[0.0555, 0.0592],
[0.0578, 0.0597],
[0.0556, 0.0590],
[0.0578, 0.0596],
[0.0603, 0.0598],
[0.0616, 0.0595],
[0.0602, 0.0596],
[0.0616, 0.0594],
[0.0600, 0.0594],
[0.0614, 0.0592],
[0.0622, 0.0585],
[0.0611, 0.0589],
[0.0617, 0.0582],
[0.0606, 0.0585]], grad_fn=<SoftmaxBackward>)
The sum of the first actions is 1 and same is the case for second action. Why is the softmax applied like this in Policy Gradient method?
Solution 1:[1]
you need to be using softmax on the second dimension so that it will calculate the probability distribution in every step. Just change the following line to:
torch.nn.Softmax(dim=1)
In general you will be applying the softmax as such in most cases in policy gradient. We always care about probalitly distrubution of the actions we can take in that state.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ahmet hamza emra |
