PPO some feedback #445

lhk · 2018-06-17T13:54:33Z

I'm having some problems with my PPO implementation, so I'm browsing your code.
Thank you so much for contributing this open-source. It's lots of fun to dig through your implementation :)

But there are some points that confuse me. Basically, my feedback is similar to that on A2C #413 : It would be great if your code had more comments, there are some design differences between the implementations and you make some changes to the paper that are not explained.

I'm reading the code on ppo and ppo2. Here are some of the things I noted:

Naming schemes are completely different. In ppo your target advantage is called atarg. But in ppo2, you call the same thing ADV. Things like this are everywhere. By the way, the naming is not even consistent in itself. You also use atarg to describe a value which is not the placeholder.
Lack of code reuse, different coding styles. For example, GAE is calculated with a similar algorithm. You even have a helper function here. But in ppo2, you don't use the same function and the algorithm looks slightly different. Another example for coding style differences would be the policies. MlpPolicy for example looks completely different but is mostly the same.
Huge architectural differences. In ppo you produce training data with a generator. In ppo2 you use an implementation of the Runner class.
Nontrivial changes to the paper. I read the openAI blogpost and the ppo paper. The key innovation is looking at a clipped ratio of action probabilities. ppo does that. ppo2 also applies this clipping to the value function. I would consider that a big difference between ppo and ppo2, which is not explained or motivated at all. I mean intuitively it makes sense. But ppo2 was described as a faster, gpu-based version of ppo, and the differences seem to be more than that.
Nontrivial changes to the paper, part 2. The code is sprinkled with small tricks. For example, you apply z normalization to the advantage function, here. This one could be my lack of understanding. But I've only seen something like this in dueling q-learning. You substract the mean of the advantage and add the value. That might work well here, too. But it's not really z-normalization. It would be great if you could comment on stuff like this. Is it important, maybe even necessary ?
Differences between the policies. In ppo, the MlpPolicy setup takes this if branch. That seems like an important modification and honestly, I don't understand why it is necessary. In ppo2, there is no such code.

lhk · 2018-06-18T08:37:12Z

This is not just meant to offload criticism (which, by the way, feels almost ungrateful. Huge thanks for this code again :) )

I would be very much interested in some hints regarding the value ratio clipping, advantage z normalization and reasons for differences between the policies. If you could just point me to the corresponding paper, that would be great.

xyshadow · 2018-07-28T19:59:43Z

Hi, @lhk

There is one more thing between PPO1 and PPO2 that I don't understand. Maybe I was wrong but in PPO1, the model actually maintains both the old and new set of parameters, and pi/oldpi actually calculated from them, but in PPO2, the old neglogpac is actually just the neglogpac calculated from the rollout. So in theory during training, the first mbatch in the first epoch, the current/old neglogpac would just be the same, and in the 3rd+ mbatch, the old neglogpac would be a few generations older, would that be equals to what is described in the PPO paper?

albarji · 2018-07-30T11:43:57Z

I'm also interested in the value function clipping. An intriguing thing I found is that the code for PPO2 takes the max between the clipped and non-clipped value loss, as follows

vpred = train_model.vf
vpredclipped = OLDVPRED + tf.clip_by_value(train_model.vf - OLDVPRED, - CLIPRANGE, CLIPRANGE)
vf_losses1 = tf.square(vpred - R)
vf_losses2 = tf.square(vpredclipped - R)
vf_loss = .5 * tf.reduce_mean(tf.maximum(vf_losses1, vf_losses2))

But since you are mixing a squared difference and the square of that same difference clipped to a limit value, doesn't that mean the non-clipped loss will always dominate the maximum, thus rendering the clipping ineffective?

xyshadow · 2018-07-31T07:47:28Z

Well, I think that depends on the value of R right? Say we have OLDPRED=0.8, vpred=1, CLIPRANGE=0.1, so vpredclipped would be 0.9. if R is more closed to 1, then vf_losses2 would be larger, otherwise if R is more closed to 0.9, vf_losses1 would be larger.

albarji · 2018-07-31T09:42:35Z

Oh, you are right! I thought this clipping always made the difference (vpred - R) smaller, but looking again at the code it seems R = advantage + OLDPRED. In the clipped loss you are using a clipped vpred that cannot deviate much from OLDPRED, but you still have the influence of the advantage in R.

So, I guess even if the network is updated so that the value prediction is perfect (vpred == OLDPRED + advantage) you will still suffer some loss if the change in predicted value is large. I imagine this has a similar effect as the one clipping presented in the PPO paper for the PG loss.

brendenpetersen · 2018-11-27T19:27:35Z

I've just switched from PPO1 to PPO2 after they cleaned up their codebase and added MPI support for PPO2. Unfortunately, performance on my environments using PPO2 (with the same hyperparameters when possible) isn't reaching PPO1's performance.

@lhk Thank you for pointing out some of these differences. I wonder if any of these are responsible for the performance hit I'm experiencing.

The value function clipping seems particularly strange, as it's relative to the scaling of the reward function (assuming returns aren't normalized). On the other hand, the likelihood ratio is self-normalized, so a unitless clip value like 0.2 actually means something (20% change). I'll try removing this value clipping.

brendenpetersen · 2018-12-06T00:51:05Z

I found a timely publication: https://arxiv.org/pdf/1811.02553.pdf.

This paper raises many questions about TRPO/PPO, both w.r.t. each algorithm and its implementation. In one set of experiments, it actually references the baselines-implementation-specific value clipping brought up here by @lhk and performs an ablation test. It also performs ablation on reward scaling, learning rate annealing (huge difference here!), and weight initialization.

takuma-yoneda · 2021-02-12T05:44:36Z

Could anyone also explain why we have 0.5 is multiplied to make the value loss?:

vf_losses1 = tf.square(vpred - R)
vf_losses2 = tf.square(vpredclipped - R)
vf_loss = .5 * tf.reduce_mean(tf.maximum(vf_losses1, vf_losses2))

araffin mentioned this issue Jul 10, 2018

why every algo reimplements policies that show up in others as well #460

Closed

araffin mentioned this issue Jul 28, 2018

Deobfuscation of the code base + pep8 and fixes #481

Closed

brett-daley mentioned this issue Mar 10, 2019

PPO2 clip value loss #765

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO some feedback #445

PPO some feedback #445

lhk commented Jun 17, 2018

lhk commented Jun 18, 2018

xyshadow commented Jul 28, 2018

albarji commented Jul 30, 2018

xyshadow commented Jul 31, 2018

albarji commented Jul 31, 2018

brendenpetersen commented Nov 27, 2018

brendenpetersen commented Dec 6, 2018

takuma-yoneda commented Feb 12, 2021

PPO some feedback #445

PPO some feedback #445

Comments

lhk commented Jun 17, 2018

lhk commented Jun 18, 2018

xyshadow commented Jul 28, 2018

albarji commented Jul 30, 2018

xyshadow commented Jul 31, 2018

albarji commented Jul 31, 2018

brendenpetersen commented Nov 27, 2018

brendenpetersen commented Dec 6, 2018

takuma-yoneda commented Feb 12, 2021