-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding normalization of advantage function #544
Comments
if you sample enough data,it's all ok. but in practice we can't, and so it's important to substract it. |
That's still not clear for me. Independent of the dataset size or used batch size the advantage function should give me positive advantages for things I should do more often or negative advantages for things I should do less often (that's why we substract the baseline/value function), or am I wrong?? Further substracting the mean does change this behaviour in my view. |
@Marcel1991 I was also wondering about this. Did you gain any insight into the benefits of normalising/ not normalising in PPO? |
I still don't understand what is the adv of normalizing the advantages. |
https://arxiv.org/pdf/2006.05990.pdf concludes that "per-minibatch advantage normalization (C67) seems not to affect the performance too much (Fig. 35)" |
Doesn't subtracting the mean from the advantages have the effect of an entropy regularizer? Ignoring the clipping, the objective is If the batch size is small so the mean and std fluctuate, then the entropy would be increased a bit more on some batches than on others, but it probably doesn't make a big difference. |
@danijar That's actually an interesting perspective. But the mean can also be negative, right? If that's the case, the second term actually makes all actions more likely. So it's a bit unclear a priori what the second term is gonna do. |
I am wondering about the normalization of the advantage function in PPO. Before training on a batch the mean of the advantage function is subtracted and it's divided by its std.
To me it makes intuitively sense to divide the advantage function by its std since then we always have the same magnitude in gradients in each update. However, I totally don't understand why it would be beneficial to substract the mean from the advantage function. In my understanding that would introduce some form of bias since now for example values that were greater 0 and that should be encouraged to do more often can now potential fall below 0 if they are smaller than the mean -> these values are falsely trained to do less often.
So whats the intuition behind substracting the mean? And does it really improve learning?
The text was updated successfully, but these errors were encountered: