-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taking action limits into account in PPO/TRPO/ACKTR. #121
Comments
UPDATE: Actually, not all of the environments do the clipping automatically. I just came across this example: https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py#L238, so in this case the algorithms above crash. |
This was very helpful for finding my issue. Thanks @hamzamerzic :) |
@rayman813 I am glad I helped! 👍 Would you mind sharing more details of what the issue was and how you fixed it, in case others stumble upon the same problem? |
What is the point of the action space having bounds if they are not respected? I have a custom environment where the action space is a Box(0.0, 1.0, n) space and want to use the PPO1/2 algo. This issue causes my env to crash as actions outside that range don't make sense. What is the actual output range of the actions in the PPO algos and how should I rescale - eg. is clipping or transforming with eg. a sigmoid better? |
@aeoost the simplest way around this is probably to create a wrapper that clips actions to the valid range. |
I'm with @aeoost. I've run into this problem as well, with the OpenAI baselines, and with other DRL implementations. In my opinion, it would make sense for the model to output actions into the correct range. The lower and upper bounds of the |
Hi. My opinion is:
|
@olegklimov In the link I posted in the comment you can see that the environment actually crashes when the action is outside of the limits, it does not penalize the actions. This means that baselines are not fully compatible with gym, if I did not miss something crucial here, like some wrapper that handles this automatically. |
@olegklimov I still feel this issue should not be closed. There is clearly an inconsistency between baselines and gym. Should I open an issue with the gym instead? |
@hamzamerzic @olegklimov Fully agree that this issue should not be closed. I think the fix belongs in baselines. It makes perfect sense for environments to have limits on the values continuous actions can take, and it should not be difficult to make a model that properly takes those into account. People shouldn't have to rely on hacks to get around this issue. Please fix the problem. |
I agree that agents should respect action ranges, and that baselines does a bad job of this right now. We not only don't clip actions, we also don't stretch the policy outputs to the correct range. For example, if an action space has a large range like [-100, 100], our agents would still start by outputting values with stddev 1. The question isn't really whether this is a problem--it's how to best fix the problem. I proposed doing the fix in a wrapper (which could be placed in @maximecb what do you have in mind when you say the model should output values in the correct range? If you parameterize a policy as a Gaussian, it must be able to output values in [-inf, inf]. If you clip the outputs coming out of the policy and use these clipped values in a policy gradient update, the resulting gradient will be biased (since the log probs won't reflect the clipped pdf). No matter how you slice it, the policy must believe that it can take any action value it wants to, otherwise the policy gradient is wrong. There are plenty of implementation-specific points to insert clipping (e.g. the argument to step() in PPO1). However, using a wrapper will not require changing every single implementation, while pretty much any other approach will. As a side note, the Gaussian distribution is probably not ideal for these kinds of problems anyway. See, for example, this paper on using the Beta distribution in RL. The Beta distribution is bounded between 0 and 1, making it more appropriate for problems where the action space is constrained to an interval. |
That sounds like the right approach to me. If the beta distribution is bounded between 0 and 1, it will be easy to translate and scale that range appropriately. With a gaussian distribution which has infinite range, you can only have hacky fixes, and it will be hard to learn some action ranges. |
Right, let's implement a wrapper because it is a correct thing to do. LunarLander specifically I'd change to -Inf..Inf actions, it already penalizes "fuel usage". I don't think action range supplied by env really has any meaning, actionable by agent. -Inf..Inf is a good example. We don't have a mechanism to "recommend" range to the agent (0..1 vs -1..+1 for example). |
Could someone please point me to an implementation where this issue is handled. |
IMO the paper pointed to by @unixpickle on the beta distribution for continuous RL is the best starting point. That author may be willing to share his implementation (if it isn't already on github). |
At the moment,even a hacky implementation with clipping will also do.I am not sure how to convert the infinite range to a finite range with clipping Is this the only change i need to handle ,that I clip the action space as below and assume that the algorithm eventually figures out to output actions in the right range?
|
@olegklimov It's still important to respect the action ranges supplied by the envs, even though they may be arbitrary as in the LunarLander case. The fact that LunarLander's internal reward signal includes a penalty for actions makes its action range particularly arbitrary/unnecessary; however, simply changing it to (-inf, inf) is problematic because 1) it assumes we have domain knowledge about the environment and 2) you can imagine it could also change the optimal policy. An action space range is simply a constraint of the problem, so the solution can't simply be to change it. Besides, real applications have continuous action spaces that are bounded, so we need algorithms that can deal with them and benchmark environments that can respect them. A wrapper is a good place to start, though it should be recognized that it's a hack, since as @unixpickle pointed out it will bias the policy gradient. Though I suppose that you had a black-box environment that clipped your actions, you'd never know...
This is precisely why implementing a Beta policy is the only non-hack solution that makes sense to me. You can do this "recommending" in a principled way with any distribution whose support is an interval (i.e. no infinity). Simply make |
Is there currently any effort of implementing a beta policy as a baseline distribution? @brendenpetersen |
@pmwenzel Not that I know of. I've started working on an implementation for an |
@brendenpetersen Sure, that would be great. |
@brendenpetersen Could you please share your implementation |
@pmwenzel @zishanahmed08 I implemented a beta policy; feel free to try it out from my fork. Unfortunately, the baselines repository as a whole is not very modularized; for example, TRPO, PPO1, PPO2, and ACKTR all have their own policy implementations (with the lone exception of TRPO sharing PPO1's MLP policy), often with identical portions of code. I'm extremely uncomfortable with that; however, I also doubt they'd fold in a bunch of structural changes to their code if I did the modularization myself. So, for now I implemented the beta MLP policy as part of PPO2. It should be straightforward if not trivial to adapt to some of the other policy gradient algorithms. Lastly, I included one hack, because sampling actions from the beta policy sometimes returned values of I'd like to reference Po-Wei Chou's thesis on the beta policy, on which I based the policy and which had some useful ideas like using a softplus activation for the beta shape parameters. |
Potentially relevant: https://arxiv.org/abs/1802.07564 |
@brendenpetersen did you run any benchmarks on beta vs non beta (ideally hopper & Walker2d)? - Unity-ML has a beta implementation in progress and when I tested it on my UnityMojoco implementations it performed less well than venila PPO - Unity-Technologies/ml-agents#581 |
@Sohojoe No, I only tested on LunarLander. I don't have a MuJoCo license. Beta-PPO didn't really perform better than Gaussian-PPO even on LunarLander; however, it's not really the fairest comparison because hyperparameters were originally tuned using the Gaussian policy. For all we know Beta could severely outperform Gaussian if its hyperparameters were independently tuned. |
I stumbled across the exact same problem, training LunarLanderContinuous-v2 with ppo1 baseline. @zishanahmed08 As you suggested, I added a single line |
Still having issues with |
Use clip. (either modify LunarLander or your code) It is tested to work. Problem is not 'correctness', problem is lack of gradient when action is clipped. But it is not a problem in this case, because fuel usage is punished in lunar lander, it's not beneficial to be at limit for a long time. |
Hi all, this is the workaround/hack which I've come up in order to respect the environments (possibly asymmetric) action bounds. In the baselines/ddpg/training.py I've added a scaling of the actions before they are executed.
This is just a simple linear scaling from the [-1, 1] range of the DDPG algorithm to the action range provided by the environment (e.g. [-3, 22.5]). It works for multiple action dimensions as well. |
Hi @brendenpetersen, I can't seem to find your fork with the beta distribution implementation. I'm having performance issues with clipping the bounds and was hoping to try your approach. Update: I found Tensorforce has implemented a beta distribution, https://github.com/reinforceio/tensorforce/blob/master/tensorforce/core/distributions/beta.py |
I would expect the env to be robust and not crash in the case of oob action. E.g. pressing up button has the same effect as pressing the up button harder - it's just not as efficient... It's up to the author of the env to decide whether to punish oob actions, or handle/clip them. For the author of the agent, using the action limits would accelerate training, but it's not a requirement. The agent will [eventually] learn that -5:5 has the same effect as -1:1, and will ignore [-5..-2]:[2..5] as having 'no benefit'. However, the agent will struggle to learn this if the env crashes |
@ghost Did you solve your problem? I have the same issue, my action boundaries are between -500 500 and the actions from the network is changing between -3 sometimes 4. Are there anybody uses a mujoco environment with a large range of tork values? I clipped my actions and it is not enough. What is the output action range for PPO ? Is there any? |
Hey. I have a similar problem. I'm building a custom environment to solve a research problem. I'm using the observation space as a way to track the status of my agent in the environment. |
hi @fbbfnc what have I done to solve the problem is, I added an action_modifier() function to my env.py file. That is taking the action from the network and adjusting the action by multiplying with the numbers suitable for my environment after multiplication, I clipped the values according to my boundaries, and that worked for me. My agent is learning with TRPO. |
Does action range clipping on environment really works well? |
Very appreciate your work but where can I find your implementation of mlp beta policy? I checked your fork "stable-baseline" but did not find it in PPO2 or Common folders. |
Hi @brendenpetersen |
This is more of a question than an issue. I noticed that in the implementation of the above-mentioned algorithms, action limits are not taken into account. Environments handle this clipping internally, so no errors will appear, but this brings us to situations where the algorithm in it's training batch has inputs that were not necessarily applied within the environment.
For example, let's say the upper limit for an input is 1 and the applied input (given from the algorithm) is 5. What the environment will experience is input 1, due to it's internal clipping, but the algorithm will in it's training batch have data with action equal to 5.
Intuitively it makes sense that the algorithm will learn how to deal with this, but I am wondering if using the information of exactly what action is applied would be beneficial? Additionally, we can think of applying the action clipping even before adding noise (since noise doesn't really do anything if the mean is already out of limits). For example, DDPG handles this nicely with tanh outputs before applying the noise, and with clipping applied afterwards.
The text was updated successfully, but these errors were encountered: