Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding normalization of advantage function #544

Open
mbcel opened this issue Aug 27, 2018 · 7 comments
Open

Understanding normalization of advantage function #544

mbcel opened this issue Aug 27, 2018 · 7 comments

Comments

@mbcel
Copy link

mbcel commented Aug 27, 2018

I am wondering about the normalization of the advantage function in PPO. Before training on a batch the mean of the advantage function is subtracted and it's divided by its std.

To me it makes intuitively sense to divide the advantage function by its std since then we always have the same magnitude in gradients in each update. However, I totally don't understand why it would be beneficial to substract the mean from the advantage function. In my understanding that would introduce some form of bias since now for example values that were greater 0 and that should be encouraged to do more often can now potential fall below 0 if they are smaller than the mean -> these values are falsely trained to do less often.

So whats the intuition behind substracting the mean? And does it really improve learning?

@initial-h
Copy link

if you sample enough data,it's all ok. but in practice we can't, and so it's important to substract it.

@mbcel
Copy link
Author

mbcel commented Sep 1, 2018

That's still not clear for me.

Independent of the dataset size or used batch size the advantage function should give me positive advantages for things I should do more often or negative advantages for things I should do less often (that's why we substract the baseline/value function), or am I wrong?? Further substracting the mean does change this behaviour in my view.

@lancerane
Copy link

@Marcel1991 I was also wondering about this. Did you gain any insight into the benefits of normalising/ not normalising in PPO?

@shtse8
Copy link

shtse8 commented Aug 27, 2020

I still don't understand what is the adv of normalizing the advantages.

@ChenDRAG
Copy link

https://arxiv.org/pdf/2006.05990.pdf concludes that "per-minibatch advantage normalization (C67) seems not to affect the performance too much (Fig. 35)"

@danijar
Copy link

danijar commented Jan 15, 2022

Doesn't subtracting the mean from the advantages have the effect of an entropy regularizer?

Ignoring the clipping, the objective is logp * (adv - mean) / std = logp * adv / std - logp * mean / std. The first term is the normalized policy gradient. The second term makes all actions in the batch less likely, effectively policy entropy.

If the batch size is small so the mean and std fluctuate, then the entropy would be increased a bit more on some batches than on others, but it probably doesn't make a big difference.

@zhihanyang2022
Copy link

zhihanyang2022 commented Jan 15, 2022

@danijar That's actually an interesting perspective. But the mean can also be negative, right? If that's the case, the second term actually makes all actions more likely. So it's a bit unclear a priori what the second term is gonna do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants