Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion with continuous action space (DiagGaussian) #109

Closed
KeirSimmons opened this issue Aug 14, 2018 · 5 comments
Closed

Confusion with continuous action space (DiagGaussian) #109

KeirSimmons opened this issue Aug 14, 2018 · 5 comments

Comments

@KeirSimmons
Copy link

My action space is as such: Box(np.array([0]), np.array([0.1]))

Using a Box automatically makes the agent model the action space using the DiagGaussian distribution. However, the actions sampled from this distribution do not lie within [0, 0.1]. Can you please explain how to interpret this, and how to effectively use the correct action from the sampled value?

@timmeinhardt
Copy link
Contributor

I am not very familiar with non-discrete action spaces but in general your policy model is not aware of the range specified in the action space. In the case of a continuous action space our policy model implements a Gaussian distribution to sample values as actions. As far as I know, the mapping to your specific range is then usually done in the step method of your environment. See for example the continuous mountain car where they clip the action value.

@KeirSimmons
Copy link
Author

So I have two different environments, one with the action space as above, and another with [0, 1000] (both continuous). I have tried two different approaches:

  1. Take the action value as is, no clipping or augmentation. This produces poor results (obviously) but it seems like the values very slowly approach the given range (due to the reward signal pushing it there). So for the [0,1000] space the values were originally around [-1.0, 1.0] I assume and this pushed up to [-10.0, 10.0] (obviously the range is actually infinite, but variance increased is what I am trying to get at).

  2. I tried to augment the action, by adding 1 and multiplying by 500. So [-1.0, 1.0] to [0, 1000]. This seemed to do well in training, but when it came to 'enjoying' the values outputted by the Gaussian (now using mode rather than sample) were much larger in magnitude than during testing, and so the chosen action was always 1000 after augmenting.

So I'm curious as to how best to approach the clipping/augmentation. Seeing as it's gaussian and 0-centred (I assume?), clipping negatives already throws away half of the distribution which will greatly bias the agent.

@rwightman
Copy link
Contributor

I'm not sure if you've noticed, but there is a similar and more extensive conversation on this clip/scale in the environment vs handle in the model topic at openai/baselines#121

@ikostrikov
Copy link
Owner

I would say the right way to handle this is to apply tanh and handle probabilities properly:

See Appendix C:
https://arxiv.org/pdf/1801.01290.pdf

@wranai
Copy link

wranai commented Aug 31, 2018

Tanh sounds good but I really like what somebody mentioned on the thread @rwightman referenced, to use the beta distribution, i.e. return not a mean/stddev pair for a normal distribution but the two parameters for a beta distribution (then transform the 0..1 range to between the Box limits). I guess the best way to return the params would be in the -inf..inf range, and then apply softplus to constrain them to be positive. 1 is a special value for the beta parameters so that's why I would go with softplus and not exp; but I may be wrong.

idobrusin added a commit to lukashermann/pytorch-a2c-ppo-acktr that referenced this issue Oct 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants