Is the range of output of PPO controlled by the action_space setting? #710

xubo92 · 2019-06-01T02:51:49Z

Hi @kashif @jbn @bichengcao @zhongwen @ViktorM ,

I have a question about the setting of action_space in Env class.

If I set up the low limit and high limit of action_space (suppose [5,10]). Is the range of action output via PPO algorithm is controlled by my action_space setting? since the output of PPO includes the mean of Gaussian distribution. Is the mean is always zero, or changed by action_space range.

Thanks !

ryanjulian · 2019-06-01T20:12:03Z

This is a question about (deep) RL fundamentals and implementation, and not really about garage. You may find my explanation unsatisfying, because our time is quite limited and this is not really the proper forum for these type of questions.

The mean of the Gaussian policy is not always zero. That would not a be a very useful policy (unless the optimal action was always 0).
The action_space range does not directly control the mean of the policy either, or even its output range. One could imagine clipping the output of the policy to action_space, but this is actually not typical in RL implementations. It is the role of the environment to enforce its action space.
The mean of the policy is a learned quantity which, in the case of GaussianMLP, is the output of a an MLP model which takes the state as input. The mean of the policy when fed a state is the agent's hypothesis of the optimal action for a given state.

Please take a look at the code if you want an in-depth understanding. You might also find the slides for this course helpful. Lecture 5 gives a nice overview of policy gradients.

I am going to close this question as off-topic. The scope of this project is creating and sustaining great implementations of deep RL algorithms, but not teaching people deep RL fundamentals. You are welcome to post questions about how to achieve things with the software, enhance it, or solving bugs, but please refrain from using this issue tracker as a resource for learning about deep RL itself. Unfortunately, we can't be all things to all people and our resources are limited.

xubo92 · 2019-06-01T20:56:18Z

@ryanjulian
Thanks a lot for your reply.

Though I think the issue should not be closed so early, what I want to know is that if "garage" had considered the possible inconsistency between "env's action range" and "model's Gaussian output". As far as I know, openai baseline does not have a good consideration on this for most policy gradient algorithms using Gaussian distribution until last year. You can see the discussion on this issue.

Sometimes when the output of the model exceeds the hard limit of env's action, the output is meaningless in env. If we simply clip it into some range, that does not help the model learn how to generate a reasonable action in a reasonable range. I think this is why I ask this question.

Thanks!

ryanjulian closed this as completed Jun 1, 2019

terrancelu92 mentioned this issue Dec 12, 2021

The environmental action limits seems not work in Tianshou PPO YangyangFu/mpc-drl-tl#92

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the range of output of PPO controlled by the action_space setting? #710

Is the range of output of PPO controlled by the action_space setting? #710

xubo92 commented Jun 1, 2019

ryanjulian commented Jun 1, 2019

xubo92 commented Jun 1, 2019

Is the range of output of PPO controlled by the action_space setting? #710

Is the range of output of PPO controlled by the action_space setting? #710

Comments

xubo92 commented Jun 1, 2019

ryanjulian commented Jun 1, 2019

xubo92 commented Jun 1, 2019