Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPO some feedback #445

Open
lhk opened this issue Jun 17, 2018 · 8 comments
Open

PPO some feedback #445

lhk opened this issue Jun 17, 2018 · 8 comments

Comments

@lhk
Copy link

lhk commented Jun 17, 2018

I'm having some problems with my PPO implementation, so I'm browsing your code.
Thank you so much for contributing this open-source. It's lots of fun to dig through your implementation :)

But there are some points that confuse me. Basically, my feedback is similar to that on A2C #413 : It would be great if your code had more comments, there are some design differences between the implementations and you make some changes to the paper that are not explained.

I'm reading the code on ppo and ppo2. Here are some of the things I noted:

  • Naming schemes are completely different. In ppo your target advantage is called atarg. But in ppo2, you call the same thing ADV. Things like this are everywhere. By the way, the naming is not even consistent in itself. You also use atarg to describe a value which is not the placeholder.

  • Lack of code reuse, different coding styles. For example, GAE is calculated with a similar algorithm. You even have a helper function here. But in ppo2, you don't use the same function and the algorithm looks slightly different. Another example for coding style differences would be the policies. MlpPolicy for example looks completely different but is mostly the same.

  • Huge architectural differences. In ppo you produce training data with a generator. In ppo2 you use an implementation of the Runner class.

  • Nontrivial changes to the paper. I read the openAI blogpost and the ppo paper. The key innovation is looking at a clipped ratio of action probabilities. ppo does that. ppo2 also applies this clipping to the value function. I would consider that a big difference between ppo and ppo2, which is not explained or motivated at all. I mean intuitively it makes sense. But ppo2 was described as a faster, gpu-based version of ppo, and the differences seem to be more than that.

  • Nontrivial changes to the paper, part 2. The code is sprinkled with small tricks. For example, you apply z normalization to the advantage function, here. This one could be my lack of understanding. But I've only seen something like this in dueling q-learning. You substract the mean of the advantage and add the value. That might work well here, too. But it's not really z-normalization. It would be great if you could comment on stuff like this. Is it important, maybe even necessary ?

  • Differences between the policies. In ppo, the MlpPolicy setup takes this if branch. That seems like an important modification and honestly, I don't understand why it is necessary. In ppo2, there is no such code.

@lhk
Copy link
Author

lhk commented Jun 18, 2018

This is not just meant to offload criticism (which, by the way, feels almost ungrateful. Huge thanks for this code again :) )

I would be very much interested in some hints regarding the value ratio clipping, advantage z normalization and reasons for differences between the policies. If you could just point me to the corresponding paper, that would be great.

@xyshadow
Copy link

Hi, @lhk

There is one more thing between PPO1 and PPO2 that I don't understand. Maybe I was wrong but in PPO1, the model actually maintains both the old and new set of parameters, and pi/oldpi actually calculated from them, but in PPO2, the old neglogpac is actually just the neglogpac calculated from the rollout. So in theory during training, the first mbatch in the first epoch, the current/old neglogpac would just be the same, and in the 3rd+ mbatch, the old neglogpac would be a few generations older, would that be equals to what is described in the PPO paper?

@albarji
Copy link

albarji commented Jul 30, 2018

I'm also interested in the value function clipping. An intriguing thing I found is that the code for PPO2 takes the max between the clipped and non-clipped value loss, as follows

vpred = train_model.vf
vpredclipped = OLDVPRED + tf.clip_by_value(train_model.vf - OLDVPRED, - CLIPRANGE, CLIPRANGE)
vf_losses1 = tf.square(vpred - R)
vf_losses2 = tf.square(vpredclipped - R)
vf_loss = .5 * tf.reduce_mean(tf.maximum(vf_losses1, vf_losses2))

But since you are mixing a squared difference and the square of that same difference clipped to a limit value, doesn't that mean the non-clipped loss will always dominate the maximum, thus rendering the clipping ineffective?

@xyshadow
Copy link

Well, I think that depends on the value of R right? Say we have OLDPRED=0.8, vpred=1, CLIPRANGE=0.1, so vpredclipped would be 0.9. if R is more closed to 1, then vf_losses2 would be larger, otherwise if R is more closed to 0.9, vf_losses1 would be larger.

@albarji
Copy link

albarji commented Jul 31, 2018

Oh, you are right! I thought this clipping always made the difference (vpred - R) smaller, but looking again at the code it seems R = advantage + OLDPRED. In the clipped loss you are using a clipped vpred that cannot deviate much from OLDPRED, but you still have the influence of the advantage in R.

So, I guess even if the network is updated so that the value prediction is perfect (vpred == OLDPRED + advantage) you will still suffer some loss if the change in predicted value is large. I imagine this has a similar effect as the one clipping presented in the PPO paper for the PG loss.

@brendenpetersen
Copy link

I've just switched from PPO1 to PPO2 after they cleaned up their codebase and added MPI support for PPO2. Unfortunately, performance on my environments using PPO2 (with the same hyperparameters when possible) isn't reaching PPO1's performance.

@lhk Thank you for pointing out some of these differences. I wonder if any of these are responsible for the performance hit I'm experiencing.

The value function clipping seems particularly strange, as it's relative to the scaling of the reward function (assuming returns aren't normalized). On the other hand, the likelihood ratio is self-normalized, so a unitless clip value like 0.2 actually means something (20% change). I'll try removing this value clipping.

@brendenpetersen
Copy link

I found a timely publication: https://arxiv.org/pdf/1811.02553.pdf.

This paper raises many questions about TRPO/PPO, both w.r.t. each algorithm and its implementation. In one set of experiments, it actually references the baselines-implementation-specific value clipping brought up here by @lhk and performs an ablation test. It also performs ablation on reward scaling, learning rate annealing (huge difference here!), and weight initialization.

@takuma-yoneda
Copy link

Could anyone also explain why we have 0.5 is multiplied to make the value loss?:

vf_losses1 = tf.square(vpred - R)
vf_losses2 = tf.square(vpredclipped - R)
vf_loss = .5 * tf.reduce_mean(tf.maximum(vf_losses1, vf_losses2))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants