Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPO2 Why combine loss function when parameters not shared between policy and value? #766

Closed
rallen10 opened this issue Dec 18, 2018 · 3 comments

Comments

@rallen10
Copy link

rallen10 commented Dec 18, 2018

In ppo2, the loss function used for training is the combined loss = policy_loss + value_loss - entropy (see: https://github.com/openai/baselines/blob/master/baselines/ppo2/model.py#L88). As it is described in Eqn 9 of the original PPO paper, this is the loss function to be used if your model shares parameters between the policy and the value. However, if you set value_network="copy", my understanding is that the parameters are no longer shared between the policy and value network; they just have the same structure but they are two separate networks. Under this condition, shouldn't the training operation be broken up into an policy training operation (i.e. actor training) and a value training operation (i.e. critic training)? As far as I can tell, that is not implemented anywhere in the ppo2 baselines code.

This question arose from trying to debug a learning process where my value loss was orders of magnitude larger than my policy loss thus it totally dominates the combined loss function and thus the learning behaviour. This further raised the question if the value loss is even correctly calculated (see related question #765), but even if it is calculated correctly, it seems that you may want to split the training of the two networks.

@opherlieber
Copy link

It should be equivalent. If network #1 doesn't effect value_loss then value_loss has no effect on the gradients of network #1, it's like adding a constant to the loss. So there is no need to have 2 implementations for the loss.

The following 2 flows should have the same result:
loss = policy_loss + value_loss -> back-propagation
loss = policy_loss -> back-propagation, loss = value_loss -> back-propagation

@rallen10
Copy link
Author

Thanks for the feedback. I'm still bringing myself up to speed so I want to make sure I check my understanding on this:

I think I can buy your claim that the policy network does not affect the value_loss, therefore the gradient of the value_loss with respect to policy network parameters = 0. It's less clear that the converse is true, though. That is to ask the question, do the value network parameters affect the policy_loss?

My thought process on this is as follows: the value network affects the advantage function (e.g. Eqn 10-12 of original paper), and the advantage function affects the policy_loss (Eqn 7), therefore the value network affects the policy_loss. In turn this would mean that the gradient of the policy_loss with respect to value network parameters is non-zero and your argument would not hold.

Counter-point: I think the fault in that thought process is the fact that the advantage function is only affected by the "old" value network, i.e. the value network parameters that were in place during the rollout that generated the training data, and are therefore static/constant with respect to the gradient descent step. This would mean that your argument holds up.

Sorry for the stream of consciousness but I figured I rather lay it all out so someone can correct me where I am wrong.

@opherlieber
Copy link

Yes right, the advantages are inputted as constants to the loss calculation, so they are not back-connected to the value network in the computation graph anymore, even though they originally derived from there. They do affect the policy-loss gradients but not the value networks.

banerjs pushed a commit to banerjs/baselines that referenced this issue Jul 21, 2020
* SAC can setup model without environment

If given an `action_space` and an `observation_space`, SAC can now setup the model without the environment and is consistent with TD3/DDPG.

* Update changelog.rst

* Update changelog.rst

Co-authored-by: Antonin RAFFIN <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants