-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PPO2 Why combine loss function when parameters not shared between policy and value? #766
Comments
It should be equivalent. If network #1 doesn't effect value_loss then value_loss has no effect on the gradients of network #1, it's like adding a constant to the loss. So there is no need to have 2 implementations for the loss. The following 2 flows should have the same result: |
Thanks for the feedback. I'm still bringing myself up to speed so I want to make sure I check my understanding on this: I think I can buy your claim that the policy network does not affect the value_loss, therefore the gradient of the value_loss with respect to policy network parameters = 0. It's less clear that the converse is true, though. That is to ask the question, do the value network parameters affect the policy_loss? My thought process on this is as follows: the value network affects the advantage function (e.g. Eqn 10-12 of original paper), and the advantage function affects the policy_loss (Eqn 7), therefore the value network affects the policy_loss. In turn this would mean that the gradient of the policy_loss with respect to value network parameters is non-zero and your argument would not hold. Counter-point: I think the fault in that thought process is the fact that the advantage function is only affected by the "old" value network, i.e. the value network parameters that were in place during the rollout that generated the training data, and are therefore static/constant with respect to the gradient descent step. This would mean that your argument holds up. Sorry for the stream of consciousness but I figured I rather lay it all out so someone can correct me where I am wrong. |
Yes right, the advantages are inputted as constants to the loss calculation, so they are not back-connected to the value network in the computation graph anymore, even though they originally derived from there. They do affect the policy-loss gradients but not the value networks. |
* SAC can setup model without environment If given an `action_space` and an `observation_space`, SAC can now setup the model without the environment and is consistent with TD3/DDPG. * Update changelog.rst * Update changelog.rst Co-authored-by: Antonin RAFFIN <[email protected]>
In ppo2, the loss function used for training is the combined
loss = policy_loss + value_loss - entropy
(see: https://github.com/openai/baselines/blob/master/baselines/ppo2/model.py#L88). As it is described in Eqn 9 of the original PPO paper, this is the loss function to be used if your model shares parameters between the policy and the value. However, if you setvalue_network="copy"
, my understanding is that the parameters are no longer shared between the policy and value network; they just have the same structure but they are two separate networks. Under this condition, shouldn't the training operation be broken up into an policy training operation (i.e. actor training) and a value training operation (i.e. critic training)? As far as I can tell, that is not implemented anywhere in the ppo2 baselines code.This question arose from trying to debug a learning process where my value loss was orders of magnitude larger than my policy loss thus it totally dominates the combined loss function and thus the learning behaviour. This further raised the question if the value loss is even correctly calculated (see related question #765), but even if it is calculated correctly, it seems that you may want to split the training of the two networks.
The text was updated successfully, but these errors were encountered: