-
Notifications
You must be signed in to change notification settings - Fork 831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A2C vs PPO Advantage normalisation #102
Comments
As far as I know the PPO paper does not mention any normalisation explicitly. However, due to the mini-batch nature of PPO such a normalisation makes sense. Maybe there is another reason for it but a n-step forward view on multiple environments could be interpreted as your dataset at a specific timestep and PPO runs mini-batch training on this dataset. Maybe for the same reasons as for supervised learning we apply a preprocessing (including normalisation) to the dataset. |
@araffin In PPO this normalization can be performed using large batches. But in a2c mini batches are usually small so it will significantly increase noise/variance of the updates. One way to implement it is to write a running normalized instead of a batch one. |
Thanks for your answers. Although, in my mind, even if i'm not really convinced by the argument for a2c. For me, a2c and ppo2 share the same idea of workers, and even if "minibatches are usually small" for A2C nothing prevent them to be large, no ? Related to openai/baselines#544 |
@ikostrikov Thanks for the detailed comments. I am curious why for small mini-batch in A2C, the normalization of advantage would increment noise/variance of the updates ? |
@araffin Yes, one can increase the size of mini batches but it might make a2c less sample efficient. @zuoxingdong because each gradient in a sum now depends on the statistics of other elements of a mini-batch. Both options are possible. |
Hello,
Looking at your implementation, I was wondering if there was any reason why the advantage is normalized in ppo, where as it is not done in a2c.
ppo: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/blob/master/algo/ppo.py#L34
a2c: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/blob/master/algo/a2c_acktr.py#L47
Surprisingly, the same choice was made in OpenAI Baselines:
ppo: https://github.com/hill-a/stable-baselines/blob/master/baselines/ppo2/ppo2.py#L98
a2c: https://github.com/hill-a/stable-baselines/blob/master/baselines/a2c/a2c.py#L65
(Also, in OpenAI Baselines, for ppo2, they additionally clip the value function, even if it is not mentioned in the paper)
The text was updated successfully, but these errors were encountered: