Understanding normalization of advantage function #544

mbcel · 2018-08-27T16:57:08Z

I am wondering about the normalization of the advantage function in PPO. Before training on a batch the mean of the advantage function is subtracted and it's divided by its std.

To me it makes intuitively sense to divide the advantage function by its std since then we always have the same magnitude in gradients in each update. However, I totally don't understand why it would be beneficial to substract the mean from the advantage function. In my understanding that would introduce some form of bias since now for example values that were greater 0 and that should be encouraged to do more often can now potential fall below 0 if they are smaller than the mean -> these values are falsely trained to do less often.

So whats the intuition behind substracting the mean? And does it really improve learning?

initial-h · 2018-08-30T13:21:16Z

if you sample enough data,it's all ok. but in practice we can't, and so it's important to substract it.

mbcel · 2018-09-01T08:46:59Z

That's still not clear for me.

Independent of the dataset size or used batch size the advantage function should give me positive advantages for things I should do more often or negative advantages for things I should do less often (that's why we substract the baseline/value function), or am I wrong?? Further substracting the mean does change this behaviour in my view.

lancerane · 2018-11-16T18:50:28Z

@Marcel1991 I was also wondering about this. Did you gain any insight into the benefits of normalising/ not normalising in PPO?

shtse8 · 2020-08-27T02:43:42Z

I still don't understand what is the adv of normalizing the advantages.

ChenDRAG · 2021-03-10T04:24:25Z

https://arxiv.org/pdf/2006.05990.pdf concludes that "per-minibatch advantage normalization (C67) seems not to affect the performance too much (Fig. 35)"

danijar · 2022-01-15T01:18:15Z

Doesn't subtracting the mean from the advantages have the effect of an entropy regularizer?

Ignoring the clipping, the objective is logp * (adv - mean) / std = logp * adv / std - logp * mean / std. The first term is the normalized policy gradient. The second term makes all actions in the batch less likely, effectively policy entropy.

If the batch size is small so the mean and std fluctuate, then the entropy would be increased a bit more on some batches than on others, but it probably doesn't make a big difference.

zhihanyang2022 · 2022-01-15T04:42:31Z

@danijar That's actually an interesting perspective. But the mean can also be negative, right? If that's the case, the second term actually makes all actions more likely. So it's a bit unclear a priori what the second term is gonna do.

araffin mentioned this issue Aug 30, 2018

A2C vs PPO Advantage normalisation ikostrikov/pytorch-a2c-ppo-acktr-gail#102

Closed

marc-rigter mentioned this issue Jan 13, 2023

Questions about the tensor shapes marc-rigter/rambo#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding normalization of advantage function #544

Understanding normalization of advantage function #544

mbcel commented Aug 27, 2018

initial-h commented Aug 30, 2018

mbcel commented Sep 1, 2018

lancerane commented Nov 16, 2018

shtse8 commented Aug 27, 2020

ChenDRAG commented Mar 10, 2021

danijar commented Jan 15, 2022

zhihanyang2022 commented Jan 15, 2022 •

edited

Loading

Understanding normalization of advantage function #544

Understanding normalization of advantage function #544

Comments

mbcel commented Aug 27, 2018

initial-h commented Aug 30, 2018

mbcel commented Sep 1, 2018

lancerane commented Nov 16, 2018

shtse8 commented Aug 27, 2020

ChenDRAG commented Mar 10, 2021

danijar commented Jan 15, 2022

zhihanyang2022 commented Jan 15, 2022 • edited Loading

zhihanyang2022 commented Jan 15, 2022 •

edited

Loading