Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the range of output of PPO controlled by the action_space setting? #710

Closed
xubo92 opened this issue Jun 1, 2019 · 2 comments
Closed

Comments

@xubo92
Copy link

xubo92 commented Jun 1, 2019

Hi @kashif @jbn @bichengcao @zhongwen @ViktorM ,

I have a question about the setting of action_space in Env class.

If I set up the low limit and high limit of action_space (suppose [5,10]). Is the range of action output via PPO algorithm is controlled by my action_space setting? since the output of PPO includes the mean of Gaussian distribution. Is the mean is always zero, or changed by action_space range.

Thanks !

@ryanjulian
Copy link
Member

This is a question about (deep) RL fundamentals and implementation, and not really about garage. You may find my explanation unsatisfying, because our time is quite limited and this is not really the proper forum for these type of questions.

  • The mean of the Gaussian policy is not always zero. That would not a be a very useful policy (unless the optimal action was always 0).
  • The action_space range does not directly control the mean of the policy either, or even its output range. One could imagine clipping the output of the policy to action_space, but this is actually not typical in RL implementations. It is the role of the environment to enforce its action space.
  • The mean of the policy is a learned quantity which, in the case of GaussianMLP, is the output of a an MLP model which takes the state as input. The mean of the policy when fed a state is the agent's hypothesis of the optimal action for a given state.

Please take a look at the code if you want an in-depth understanding. You might also find the slides for this course helpful. Lecture 5 gives a nice overview of policy gradients.

I am going to close this question as off-topic. The scope of this project is creating and sustaining great implementations of deep RL algorithms, but not teaching people deep RL fundamentals. You are welcome to post questions about how to achieve things with the software, enhance it, or solving bugs, but please refrain from using this issue tracker as a resource for learning about deep RL itself. Unfortunately, we can't be all things to all people and our resources are limited.

@xubo92
Copy link
Author

xubo92 commented Jun 1, 2019

@ryanjulian
Thanks a lot for your reply.

Though I think the issue should not be closed so early, what I want to know is that if "garage" had considered the possible inconsistency between "env's action range" and "model's Gaussian output". As far as I know, openai baseline does not have a good consideration on this for most policy gradient algorithms using Gaussian distribution until last year. You can see the discussion on this issue.

Sometimes when the output of the model exceeds the hard limit of env's action, the output is meaningless in env. If we simply clip it into some range, that does not help the model learn how to generate a reasonable action in a reasonable range. I think this is why I ask this question.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants