[Question] Justifying advantage normalization for PPO #485

zhihanyang2022 · 2021-06-22T09:09:00Z

Question

For PPO, I understand that advantage normalization (for each batch of experiences) is sort of a standard practice. I've seen other implementations do it, too. However, I find it a little un-justified and here's why.

If we are using GAE, then each advantage is a weighted sum of a whole bunch of td deltas: r+gamma * V(s')-V(s). Suppose most of these deltas are positive (which is not an unreasonable assumption, especially when training is going well, i.e., when the action taken is increasingly better than the "average action"), then advantages for earlier transitions would be higher than those for later transitions, simply because towards the end of episode there are less td deltas to sum.

In this case, normalizing advantages (which involves subtracting the mean) would give early transitions positive advantage and later transitions negative advantage, which might affect performance and doesn't make sense intuitively. Also, the gist of policy gradient algorithms is that we should encourage an action with positive advantage whenever we can, and some arguments like "give the model something to encourage and something to discourage every batch of updates" is not convincing enough.

Are there stronger justifications (e.g., papers) on why advantage normalization should be used by default in SB3? Have anyone investigated the practical differences?

A more sound alternative seems to be dividing by the max or std, without subtracting the mean.

Thanks!

Context

I've checked this issue but it doesn't resolve my confusion (it's not even closed lol):

A2C vs PPO Advantage normalisation ikostrikov/pytorch-a2c-ppo-acktr-gail#102

Checklist

I have read the documentation (required)
I have checked that there is no similar issue in the repo (required)

The text was updated successfully, but these errors were encountered:

araffin · 2021-06-22T09:50:34Z

Hello,
indeed good question.

as a initial answer, we are doing that to match openai baselines results. I think that in practice it should not change much the results (there was an empirical study on ppo but I'm not sure if they varied that param)

an additional link: hill-a/stable-baselines#638

zhihanyang2022 · 2021-06-23T09:41:24Z

I'm digging into this a bit, so let's keep this issue open and I will post what I found for future references.

araffin · 2021-06-29T08:26:10Z

then advantages for earlier transitions would be higher than those for later transitions, simply because towards the end of episode there are less td deltas to sum.

I'm not sure to follow you on that point.

Each term that is summed should have more or less the same magnitude (after a quick test on CartPole, this seems to be the case) and you should see the summation with discount (definition of GAE) as a weighted average (weighting the closest deltas more than the further ones) so it should give you more or less the same magnitude at the end, no?

o-Oscar · 2021-07-08T08:34:35Z

I have had the same question for a few years now. I haven't found any answer.
After testing both on a personal benchmark, I have found no sign that this normalization has any effect (neither good nor bad).

PS : Please note that although this "personal benchmark" was loosely on stable-baseline 2, it was not well tested and might have had some silent bugs. Furthermore, I did not investigate this question scientifically : I just ran two training session, one with the normalization and one with the line commented out, I compared the two runs, the performances matched, and I went on with my life.

Miffyli · 2021-07-08T09:42:21Z

The "large scale emperical study" on on-policy algorithms says "advantage normalization does not seem to effect performance much" (Section 3.3), with varying results between envs, much like you said @o-Oscar.

zhihanyang2022 · 2021-07-14T00:41:30Z

Sorry for the delay.

@araffin Yes, what I said indeed does not happen when you bootstrap correctly at the final step (I checked the code in stable-baselines3 again, which does exactly this).

But the problem persists in the case when people don't bootstrap in the final step (in continuous control env; in episodic env, of course no bootstrap is needed when the task end gracefully). This happens when people use the one-sample return in place of the advantage. According to my knowledge, this is how most people implement their first policy gradient project (with e.g., cartpole), but it still works.

In response to this, maybe a plot would help, but I think it's quite self-evident. Let me know what you think!

@Miffyli Regarding the empirical study you mentioned, I think it's great. Here's a more mathy justification for normalization of advantages (from CS258 lecture 6 slide "Critics as state-dependent baselines" for those who are interested):

The slides showed that subtracting an action-independent baseline from each advantage does not change the policy gradient expectation (but if it's already biased due to bootstrapping, I think it won't make it more biased). This means that subtracting by the mean of all advantages (which is not a function of action) is okay to do. In general, doing so with one-sample returns can reduce variance and improve performance. But maybe advantages are already sort of zero-centered (because you have subtracted by the estimated state-value already), de-meaning again may not help as much.
Then all that's left is scaling down by standard deviation, which doesn't change the sign of the resulting advantages from the previous step. This is easier to understand - keeping the same magnitude of advantages throughout training may make training more robust to hyperparameters like learning rate.

Hope it helps!

zhihanyang2022 added the question Further information is requested label Jun 22, 2021

zhihanyang2022 closed this as completed Jul 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Justifying advantage normalization for PPO #485

[Question] Justifying advantage normalization for PPO #485

zhihanyang2022 commented Jun 22, 2021 •

edited

Loading

araffin commented Jun 22, 2021

zhihanyang2022 commented Jun 23, 2021

araffin commented Jun 29, 2021

o-Oscar commented Jul 8, 2021

Miffyli commented Jul 8, 2021

zhihanyang2022 commented Jul 14, 2021 •

edited

Loading

[Question] Justifying advantage normalization for PPO #485

[Question] Justifying advantage normalization for PPO #485

Comments

zhihanyang2022 commented Jun 22, 2021 • edited Loading

Question

Context

Checklist

araffin commented Jun 22, 2021

zhihanyang2022 commented Jun 23, 2021

araffin commented Jun 29, 2021

o-Oscar commented Jul 8, 2021

Miffyli commented Jul 8, 2021

zhihanyang2022 commented Jul 14, 2021 • edited Loading

zhihanyang2022 commented Jun 22, 2021 •

edited

Loading

zhihanyang2022 commented Jul 14, 2021 •

edited

Loading