Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A2C vs PPO Advantage normalisation #102

Closed
araffin opened this issue Jul 30, 2018 · 5 comments
Closed

A2C vs PPO Advantage normalisation #102

araffin opened this issue Jul 30, 2018 · 5 comments

Comments

@araffin
Copy link

araffin commented Jul 30, 2018

Hello,

Looking at your implementation, I was wondering if there was any reason why the advantage is normalized in ppo, where as it is not done in a2c.
ppo: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/blob/master/algo/ppo.py#L34
a2c: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/blob/master/algo/a2c_acktr.py#L47

Surprisingly, the same choice was made in OpenAI Baselines:
ppo: https://github.com/hill-a/stable-baselines/blob/master/baselines/ppo2/ppo2.py#L98
a2c: https://github.com/hill-a/stable-baselines/blob/master/baselines/a2c/a2c.py#L65

(Also, in OpenAI Baselines, for ppo2, they additionally clip the value function, even if it is not mentioned in the paper)

@timmeinhardt
Copy link
Contributor

As far as I know the PPO paper does not mention any normalisation explicitly. However, due to the mini-batch nature of PPO such a normalisation makes sense. Maybe there is another reason for it but a n-step forward view on multiple environments could be interpreted as your dataset at a specific timestep and PPO runs mini-batch training on this dataset. Maybe for the same reasons as for supervised learning we apply a preprocessing (including normalisation) to the dataset.

@ikostrikov
Copy link
Owner

@araffin In PPO this normalization can be performed using large batches. But in a2c mini batches are usually small so it will significantly increase noise/variance of the updates.

One way to implement it is to write a running normalized instead of a batch one.

@araffin
Copy link
Author

araffin commented Aug 30, 2018

Thanks for your answers. Although, in my mind, even if i'm not really convinced by the argument for a2c. For me, a2c and ppo2 share the same idea of workers, and even if "minibatches are usually small" for A2C nothing prevent them to be large, no ?

Related to openai/baselines#544

@zuoxingdong
Copy link

@ikostrikov Thanks for the detailed comments. I am curious why for small mini-batch in A2C, the normalization of advantage would increment noise/variance of the updates ?
And the normalization of mini-batch, the mean and std are calculated over all elements of entire batch, or they should be calculated over the batch dimension i.e. each data entry has its own normalized advantages.

@ikostrikov
Copy link
Owner

@araffin Yes, one can increase the size of mini batches but it might make a2c less sample efficient.

@zuoxingdong because each gradient in a sum now depends on the statistics of other elements of a mini-batch. Both options are possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants