Taking action limits into account in PPO/TRPO/ACKTR. #121

hamzamerzic · 2017-08-29T13:06:07Z

This is more of a question than an issue. I noticed that in the implementation of the above-mentioned algorithms, action limits are not taken into account. Environments handle this clipping internally, so no errors will appear, but this brings us to situations where the algorithm in it's training batch has inputs that were not necessarily applied within the environment.

For example, let's say the upper limit for an input is 1 and the applied input (given from the algorithm) is 5. What the environment will experience is input 1, due to it's internal clipping, but the algorithm will in it's training batch have data with action equal to 5.

Intuitively it makes sense that the algorithm will learn how to deal with this, but I am wondering if using the information of exactly what action is applied would be beneficial? Additionally, we can think of applying the action clipping even before adding noise (since noise doesn't really do anything if the mean is already out of limits). For example, DDPG handles this nicely with tanh outputs before applying the noise, and with clipping applied afterwards.

hamzamerzic · 2017-09-13T12:00:26Z

UPDATE: Actually, not all of the environments do the clipping automatically. I just came across this example: https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py#L238, so in this case the algorithms above crash.
@joschu Could you confirm if this is a bug, or the intended behavior? I can put up a PR to fix it if it is.

rayman813 · 2017-09-15T17:14:27Z

This was very helpful for finding my issue. Thanks @hamzamerzic :)

hamzamerzic · 2017-09-17T10:31:53Z

@rayman813 I am glad I helped! 👍 Would you mind sharing more details of what the issue was and how you fixed it, in case others stumble upon the same problem?

ghost · 2017-11-17T02:03:55Z

What is the point of the action space having bounds if they are not respected? I have a custom environment where the action space is a Box(0.0, 1.0, n) space and want to use the PPO1/2 algo. This issue causes my env to crash as actions outside that range don't make sense. What is the actual output range of the actions in the PPO algos and how should I rescale - eg. is clipping or transforming with eg. a sigmoid better?

unixpickle · 2017-12-16T02:13:22Z

@aeoost the simplest way around this is probably to create a wrapper that clips actions to the valid range.

maximecb · 2018-01-15T18:56:34Z

I'm with @aeoost. I've run into this problem as well, with the OpenAI baselines, and with other DRL implementations. In my opinion, it would make sense for the model to output actions into the correct range. The lower and upper bounds of the action_space should be taken into account. It's part of the specification of the environments. The specification should be respected, otherwise, there is no point in having those bounds to begin with.

olegklimov · 2018-02-05T21:15:09Z

Hi. My opinion is:

Allow policy to output values higher than specified, clip it in env or wrapper.
Add a little punishment (negative reward) for any high powered actions, like LunarLander does. You have a choice to punish clipped or unclipped actions, both work. Learned policy will be more economical in using actions and look better.

hamzamerzic · 2018-02-05T21:20:13Z

@olegklimov In the link I posted in the comment you can see that the environment actually crashes when the action is outside of the limits, it does not penalize the actions. This means that baselines are not fully compatible with gym, if I did not miss something crucial here, like some wrapper that handles this automatically.

hamzamerzic · 2018-02-09T11:47:12Z

@olegklimov I still feel this issue should not be closed. There is clearly an inconsistency between baselines and gym. Should I open an issue with the gym instead?

maximecb · 2018-02-09T14:31:55Z

@hamzamerzic @olegklimov Fully agree that this issue should not be closed. I think the fix belongs in baselines. It makes perfect sense for environments to have limits on the values continuous actions can take, and it should not be difficult to make a model that properly takes those into account. People shouldn't have to rely on hacks to get around this issue. Please fix the problem.

unixpickle · 2018-02-10T01:10:31Z

I agree that agents should respect action ranges, and that baselines does a bad job of this right now. We not only don't clip actions, we also don't stretch the policy outputs to the correct range. For example, if an action space has a large range like [-100, 100], our agents would still start by outputting values with stddev 1.

The question isn't really whether this is a problem--it's how to best fix the problem. I proposed doing the fix in a wrapper (which could be placed in baselines/common). This seems like a hack, and it kind of is. However, it would be easy to drop the wrapper into every algorithm in baselines and have an immediate universal fix. All the other fixes I can think of would be algorithm-specific.

@maximecb what do you have in mind when you say the model should output values in the correct range? If you parameterize a policy as a Gaussian, it must be able to output values in [-inf, inf]. If you clip the outputs coming out of the policy and use these clipped values in a policy gradient update, the resulting gradient will be biased (since the log probs won't reflect the clipped pdf). No matter how you slice it, the policy must believe that it can take any action value it wants to, otherwise the policy gradient is wrong.

There are plenty of implementation-specific points to insert clipping (e.g. the argument to step() in PPO1). However, using a wrapper will not require changing every single implementation, while pretty much any other approach will.

As a side note, the Gaussian distribution is probably not ideal for these kinds of problems anyway. See, for example, this paper on using the Beta distribution in RL. The Beta distribution is bounded between 0 and 1, making it more appropriate for problems where the action space is constrained to an interval.

maximecb · 2018-02-10T02:58:58Z

this paper on using the Beta distribution in RL. The Beta distribution is bounded between 0 and 1, making it more appropriate for problems where the action space is constrained to an interval.

That sounds like the right approach to me. If the beta distribution is bounded between 0 and 1, it will be easy to translate and scale that range appropriately. With a gaussian distribution which has infinite range, you can only have hacky fixes, and it will be hard to learn some action ranges.

olegklimov · 2018-02-12T23:44:37Z

Right, let's implement a wrapper because it is a correct thing to do.

LunarLander specifically I'd change to -Inf..Inf actions, it already penalizes "fuel usage".

I don't think action range supplied by env really has any meaning, actionable by agent. -Inf..Inf is a good example. We don't have a mechanism to "recommend" range to the agent (0..1 vs -1..+1 for example).
And probably don't want, as this should be discoverable by agent automatically.

zishanahmed08 · 2018-03-01T17:16:04Z

Could someone please point me to an implementation where this issue is handled.

maximecb · 2018-03-01T17:30:01Z

IMO the paper pointed to by @unixpickle on the beta distribution for continuous RL is the best starting point. That author may be willing to share his implementation (if it isn't already on github).

zishanahmed08 · 2018-03-01T18:40:26Z

At the moment,even a hacky implementation with clipping will also do.I am not sure how to convert the infinite range to a finite range with clipping

Is this the only change i need to handle ,that I clip the action space as below and assume that the algorithm eventually figures out to output actions in the right range?

action = np.clip(action, self.action_space.low, self.action_space.high)

brendenpetersen · 2018-03-29T21:53:03Z

@olegklimov It's still important to respect the action ranges supplied by the envs, even though they may be arbitrary as in the LunarLander case. The fact that LunarLander's internal reward signal includes a penalty for actions makes its action range particularly arbitrary/unnecessary; however, simply changing it to (-inf, inf) is problematic because 1) it assumes we have domain knowledge about the environment and 2) you can imagine it could also change the optimal policy. An action space range is simply a constraint of the problem, so the solution can't simply be to change it. Besides, real applications have continuous action spaces that are bounded, so we need algorithms that can deal with them and benchmark environments that can respect them.

A wrapper is a good place to start, though it should be recognized that it's a hack, since as @unixpickle pointed out it will bias the policy gradient. Though I suppose that you had a black-box environment that clipped your actions, you'd never know...

We don't have a mechanism to "recommend" range to the agent (0..1 vs -1..+1 for example).

This is precisely why implementing a Beta policy is the only non-hack solution that makes sense to me. You can do this "recommending" in a principled way with any distribution whose support is an interval (i.e. no infinity). Simply make action_space.low correspond to the lower bound of the support and action_space.high correspond to the upper bound. You could also use a Gamma for an environment with an action range of (0, inf), for example.

pmwenzel · 2018-04-03T08:28:34Z

Is there currently any effort of implementing a beta policy as a baseline distribution? @brendenpetersen

brendenpetersen · 2018-04-03T18:35:26Z

@pmwenzel Not that I know of. I've started working on an implementation for an MlpBetaPolicy class (analogous to MlpPolicy, which uses a diagonal Gaussian policy) and associated distributions. I can share the fork with you.

pmwenzel · 2018-04-04T07:03:14Z

@brendenpetersen Sure, that would be great.

zishanahmed08 · 2018-04-08T21:13:10Z

@brendenpetersen Could you please share your implementation

brendenpetersen · 2018-04-11T17:16:58Z

@pmwenzel @zishanahmed08 I implemented a beta policy; feel free to try it out from my fork. Unfortunately, the baselines repository as a whole is not very modularized; for example, TRPO, PPO1, PPO2, and ACKTR all have their own policy implementations (with the lone exception of TRPO sharing PPO1's MLP policy), often with identical portions of code. I'm extremely uncomfortable with that; however, I also doubt they'd fold in a bunch of structural changes to their code if I did the modularization myself. So, for now I implemented the beta MLP policy as part of PPO2. It should be straightforward if not trivial to adapt to some of the other policy gradient algorithms.

Lastly, I included one hack, because sampling actions from the beta policy sometimes returned values of 0. or 1. (the very ends of the support of the beta distribution). I'm not sure why--perhaps it's a floating point precision issue. At any rate, this resulted in numerical issues for downstream calculations (like the likelihood ratio). The hack simply clips samples to [1e-5, 1 - 1e-5].

I'd like to reference Po-Wei Chou's thesis on the beta policy, on which I based the policy and which had some useful ideas like using a softplus activation for the beta shape parameters.

unixpickle · 2018-04-12T16:14:08Z

Potentially relevant: https://arxiv.org/abs/1802.07564

Sohojoe · 2018-04-12T20:37:56Z

@brendenpetersen did you run any benchmarks on beta vs non beta (ideally hopper & Walker2d)? - Unity-ML has a beta implementation in progress and when I tested it on my UnityMojoco implementations it performed less well than venila PPO - Unity-Technologies/ml-agents#581

brendenpetersen · 2018-04-23T17:21:20Z

@Sohojoe No, I only tested on LunarLander. I don't have a MuJoCo license.

Beta-PPO didn't really perform better than Gaussian-PPO even on LunarLander; however, it's not really the fairest comparison because hyperparameters were originally tuned using the Gaussian policy. For all we know Beta could severely outperform Gaussian if its hyperparameters were independently tuned.

brucewayne1248 · 2018-05-16T14:18:00Z

I stumbled across the exact same problem, training LunarLanderContinuous-v2 with ppo1 baseline. @zishanahmed08 As you suggested, I added a single line
action = np.clip(action, self.action_space.low, self.action_space.high)
above the assertion of the action in lunar lander environment: lunarlandercode. And this solved the problem for me. This solution is most likely suboptimal.

david1309 · 2018-06-14T02:46:37Z

Still having issues with LunarLanderContinuous-v2 ... in the end what solutioins is better, to constrain the agent (e.g. PPO) to output actions within the legal range ? or to simply within the environments code correct for out of range actions via np.clip ?

olegklimov · 2018-06-14T07:27:56Z

Use clip. (either modify LunarLander or your code)

It is tested to work.

Problem is not 'correctness', problem is lack of gradient when action is clipped. But it is not a problem in this case, because fuel usage is punished in lunar lander, it's not beneficial to be at limit for a long time.

joellutz · 2018-07-26T15:31:03Z

Hi all, this is the workaround/hack which I've come up in order to respect the environments (possibly asymmetric) action bounds. In the baselines/ddpg/training.py I've added a scaling of the actions before they are executed.

# ...
for t_rollout in range(nb_rollout_steps):
    # Predict next action.
    action, q = agent.pi(obs, apply_noise=True, compute_Q=True)
    # action is an array with entries between -1 and 1

    # scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
    target = scale_range(action, -1, 1, env.action_space.low, env.action_space.high)
    # target is an array with scaled actions
    
    # Execute next action.
    if rank == 0 and render:
        env.render()
    assert target.shape == env.action_space.shape
    new_obs, r, done, info = env.step(target)
    t += 1
    if rank == 0 and render:
        env.render()
    # ...

def scale_range(x, x_min, x_max, y_min, y_max):
    """ Scales the entries in x which have a range between x_min and x_max
    to the range defined between y_min and y_max. """
    # y = a*x + b
    # a = deltaY/deltaX
    # b = y_min - a*x_min (or b = y_max - a*x_max)
    y = (y_max - y_min) / (x_max - x_min) * x + (y_min*x_max - y_max*x_min) / (x_max - x_min)
    return y

This is just a simple linear scaling from the [-1, 1] range of the DDPG algorithm to the action range provided by the environment (e.g. [-3, 22.5]). It works for multiple action dimensions as well.
I don't know if that's the way to go, but it worked for me (at least I could cope somehow with the asymmetric action bounds of my environment). If you have any concerns, feel free to comment on my solution.

wil3 · 2018-11-06T21:47:14Z

Hi @brendenpetersen, I can't seem to find your fork with the beta distribution implementation. I'm having performance issues with clipping the bounds and was hoping to try your approach.

Update: I found Tensorforce has implemented a beta distribution, https://github.com/reinforceio/tensorforce/blob/master/tensorforce/core/distributions/beta.py

dcolley · 2019-02-07T09:08:47Z

I would expect the env to be robust and not crash in the case of oob action. E.g. pressing up button has the same effect as pressing the up button harder - it's just not as efficient...

It's up to the author of the env to decide whether to punish oob actions, or handle/clip them.

For the author of the agent, using the action limits would accelerate training, but it's not a requirement. The agent will [eventually] learn that -5:5 has the same effect as -1:1, and will ignore [-5..-2]:[2..5] as having 'no benefit'.

However, the agent will struggle to learn this if the env crashes

ghost · 2019-02-27T11:23:26Z

What is the point of the action space having bounds if they are not respected? I have a custom environment where the action space is a Box(0.0, 1.0, n) space and want to use the PPO1/2 algo. This issue causes my env to crash as actions outside that range don't make sense. What is the actual output range of the actions in the PPO algos and how should I rescale - eg. is clipping or transforming with eg. a sigmoid better?

@ghost Did you solve your problem? I have the same issue, my action boundaries are between -500 500 and the actions from the network is changing between -3 sometimes 4. Are there anybody uses a mujoco environment with a large range of tork values? I clipped my actions and it is not enough. What is the output action range for PPO ? Is there any?

fbbfnc · 2019-04-25T18:09:10Z

Hey. I have a similar problem. I'm building a custom environment to solve a research problem.
This is my observation space and every action changes these values, increasing or decreasing them.
self.observation_space = Box( low=np.array([0,100,200,300,400,500,600,700...]), high=np.array([1000,20000,3000,4000,5000,6000,7000...]), dtype=np.int64)

I'm using the observation space as a way to track the status of my agent in the environment.
I've imported my environment in the Rllib library to use some algorithms.
I can't go out of bounds but that was happening, making my algorithm fail. So i've implemented a check in the step() function that skips the action if it takes out of bounds the observation space and gives back a reward as in the previous step.
Is there a cleaner way to do so, in your opinion?
Maybe a good idea is to reduce the action_space when a value is at the edge of the domain, but how can i do that?

ghost · 2019-04-30T20:40:08Z

hi @fbbfnc what have I done to solve the problem is, I added an action_modifier() function to my env.py file. That is taking the action from the network and adjusting the action by multiplying with the numbers suitable for my environment after multiplication, I clipped the values according to my boundaries, and that worked for me. My agent is learning with TRPO.

OnedgeLee · 2019-05-22T00:44:24Z

Does action range clipping on environment really works well?
My custom environment action range is far from [-1, 1]
and when I apply clipping on environment, agent hit the clipped value near [-1, 1] side on and on.
Because "proba_distribution_from_latent()" has initial parameters "init_scale=1.0, init_bias=0.0",
initial action range is [-1, 1].
As training progresses, this range have to adaptively move toward the range which covers that of environment, but it doesn't seems act like that, because agent cannot explore due to clipping, no learning occurs. (stddev cannot role because of clipping)
No exploring, no information gain, no learning.
In my opinion, init_bias of linear function of "proba_distribution_from_latent()", which infers mean, should be in environment action range.
Am I missing some point?

xubo92 · 2019-06-01T21:49:50Z

Hi @brendenpetersen

Very appreciate your work but where can I find your implementation of mlp beta policy? I checked your fork "stable-baseline" but did not find it in PPO2 or Common folders.

denyHell · 2020-02-03T22:08:28Z

Hi @brendenpetersen
Have you tested that sampling from beta can result in value 0 or 1? I am using tfp.distributions.Beta to get instance of the distribution. Have you done any parameters tuning for beta policy?

olegklimov closed this as completed Feb 5, 2018

olegklimov reopened this Feb 12, 2018

rwightman mentioned this issue Aug 25, 2018

Confusion with continuous action space (DiagGaussian) ikostrikov/pytorch-a2c-ppo-acktr-gail#109

Closed

pzhokhov pushed a commit that referenced this issue Oct 3, 2018

minor change to install.py and baselines run.py (#121)

9ee804c

skervim mentioned this issue Dec 4, 2018

Beta distribution as policy for environments with bounded continuous action spaces [feature request] hill-a/stable-baselines#112

Open

jar577 mentioned this issue Jan 7, 2019

Observation and Action Spaces Violated openai/gym#1278

Closed

christopherhesse mentioned this issue Mar 1, 2019

Runtime warning: invalide value encountered in remainder openai/gym#1244

Closed

huiwenn pushed a commit to huiwenn/baselines that referenced this issue Mar 20, 2019

minor change to install.py and baselines run.py (openai#121)

7eaad9e

This was referenced Apr 19, 2019

Giving action to environments out of the limits still work and may mislead the users openai/gym#1442

Closed

Mujoco Action Space openai/gym#604

Closed

xubo92 mentioned this issue Jun 1, 2019

Is the range of output of PPO controlled by the action_space setting? rlworkgroup/garage#710

Closed

kkonen pushed a commit to kkonen/baselines-1 that referenced this issue Sep 26, 2019

minor change to install.py and baselines run.py (openai#121)

a96fe6b

kkonen pushed a commit to kkonen/baselines-1 that referenced this issue Sep 26, 2019

minor change to install.py and baselines run.py (openai#121)

d7d8133

kuanghuei mentioned this issue Mar 2, 2020

ActorDistributionNetwork with bounded array_specs tensorflow/agents#216

Open

terrancelu92 mentioned this issue Dec 12, 2021

The environmental action limits seems not work in Tianshou PPO YangyangFu/mpc-drl-tl#92

Closed

Taking action limits into account in PPO/TRPO/ACKTR. #121

Taking action limits into account in PPO/TRPO/ACKTR. #121

Comments

hamzamerzic commented Aug 29, 2017

hamzamerzic commented Sep 13, 2017 • edited Loading

rayman813 commented Sep 15, 2017

hamzamerzic commented Sep 17, 2017

ghost commented Nov 17, 2017

unixpickle commented Dec 16, 2017

maximecb commented Jan 15, 2018

olegklimov commented Feb 5, 2018

hamzamerzic commented Feb 5, 2018 • edited Loading

hamzamerzic commented Feb 9, 2018

maximecb commented Feb 9, 2018 • edited Loading

unixpickle commented Feb 10, 2018

maximecb commented Feb 10, 2018 • edited Loading

olegklimov commented Feb 12, 2018

zishanahmed08 commented Mar 1, 2018

maximecb commented Mar 1, 2018

zishanahmed08 commented Mar 1, 2018 • edited Loading

brendenpetersen commented Mar 29, 2018 • edited Loading

pmwenzel commented Apr 3, 2018

brendenpetersen commented Apr 3, 2018

pmwenzel commented Apr 4, 2018

zishanahmed08 commented Apr 8, 2018

brendenpetersen commented Apr 11, 2018 • edited Loading

unixpickle commented Apr 12, 2018

Sohojoe commented Apr 12, 2018

brendenpetersen commented Apr 23, 2018

brucewayne1248 commented May 16, 2018 • edited Loading

david1309 commented Jun 14, 2018

olegklimov commented Jun 14, 2018

joellutz commented Jul 26, 2018

wil3 commented Nov 6, 2018 • edited Loading

dcolley commented Feb 7, 2019

ghost commented Feb 27, 2019 • edited by ghost Loading

fbbfnc commented Apr 25, 2019

ghost commented Apr 30, 2019

OnedgeLee commented May 22, 2019

xubo92 commented Jun 1, 2019

denyHell commented Feb 3, 2020

hamzamerzic commented Sep 13, 2017 •

edited

Loading

hamzamerzic commented Feb 5, 2018 •

edited

Loading

maximecb commented Feb 9, 2018 •

edited

Loading

maximecb commented Feb 10, 2018 •

edited

Loading

zishanahmed08 commented Mar 1, 2018 •

edited

Loading

brendenpetersen commented Mar 29, 2018 •

edited

Loading

brendenpetersen commented Apr 11, 2018 •

edited

Loading

brucewayne1248 commented May 16, 2018 •

edited

Loading

wil3 commented Nov 6, 2018 •

edited

Loading

ghost commented Feb 27, 2019 •

edited by ghost

Loading