A2C vs PPO Advantage normalisation #102

araffin · 2018-07-30T12:56:14Z

Hello,

Looking at your implementation, I was wondering if there was any reason why the advantage is normalized in ppo, where as it is not done in a2c.
ppo: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/blob/master/algo/ppo.py#L34
a2c: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/blob/master/algo/a2c_acktr.py#L47

Surprisingly, the same choice was made in OpenAI Baselines:
ppo: https://github.com/hill-a/stable-baselines/blob/master/baselines/ppo2/ppo2.py#L98
a2c: https://github.com/hill-a/stable-baselines/blob/master/baselines/a2c/a2c.py#L65

(Also, in OpenAI Baselines, for ppo2, they additionally clip the value function, even if it is not mentioned in the paper)

timmeinhardt · 2018-08-13T15:56:21Z

As far as I know the PPO paper does not mention any normalisation explicitly. However, due to the mini-batch nature of PPO such a normalisation makes sense. Maybe there is another reason for it but a n-step forward view on multiple environments could be interpreted as your dataset at a specific timestep and PPO runs mini-batch training on this dataset. Maybe for the same reasons as for supervised learning we apply a preprocessing (including normalisation) to the dataset.

ikostrikov · 2018-08-27T14:31:35Z

@araffin In PPO this normalization can be performed using large batches. But in a2c mini batches are usually small so it will significantly increase noise/variance of the updates.

One way to implement it is to write a running normalized instead of a batch one.

araffin · 2018-08-30T13:43:37Z

Thanks for your answers. Although, in my mind, even if i'm not really convinced by the argument for a2c. For me, a2c and ppo2 share the same idea of workers, and even if "minibatches are usually small" for A2C nothing prevent them to be large, no ?

Related to openai/baselines#544

zuoxingdong · 2018-09-20T10:59:34Z

@ikostrikov Thanks for the detailed comments. I am curious why for small mini-batch in A2C, the normalization of advantage would increment noise/variance of the updates ?
And the normalization of mini-batch, the mean and std are calculated over all elements of entire batch, or they should be calculated over the batch dimension i.e. each data entry has its own normalized advantages.

ikostrikov · 2018-09-22T15:01:35Z

@araffin Yes, one can increase the size of mini batches but it might make a2c less sample efficient.

@zuoxingdong because each gradient in a sum now depends on the statistics of other elements of a mini-batch. Both options are possible.

ikostrikov closed this as completed Aug 27, 2018

ikostrikov reopened this Aug 27, 2018

ikostrikov closed this as completed Aug 30, 2018

zhihanyang2022 mentioned this issue Jun 22, 2021

[Question] Justifying advantage normalization for PPO DLR-RM/stable-baselines3#485

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A2C vs PPO Advantage normalisation #102

A2C vs PPO Advantage normalisation #102

araffin commented Jul 30, 2018 •

edited

Loading

timmeinhardt commented Aug 13, 2018

ikostrikov commented Aug 27, 2018

araffin commented Aug 30, 2018 •

edited

Loading

zuoxingdong commented Sep 20, 2018

ikostrikov commented Sep 22, 2018

A2C vs PPO Advantage normalisation #102

A2C vs PPO Advantage normalisation #102

Comments

araffin commented Jul 30, 2018 • edited Loading

timmeinhardt commented Aug 13, 2018

ikostrikov commented Aug 27, 2018

araffin commented Aug 30, 2018 • edited Loading

zuoxingdong commented Sep 20, 2018

ikostrikov commented Sep 22, 2018

araffin commented Jul 30, 2018 •

edited

Loading

araffin commented Aug 30, 2018 •

edited

Loading