PPO2 Why combine loss function when parameters not shared between policy and value? #766

rallen10 · 2018-12-18T22:31:16Z

In ppo2, the loss function used for training is the combined loss = policy_loss + value_loss - entropy (see: https://github.com/openai/baselines/blob/master/baselines/ppo2/model.py#L88). As it is described in Eqn 9 of the original PPO paper, this is the loss function to be used if your model shares parameters between the policy and the value. However, if you set value_network="copy", my understanding is that the parameters are no longer shared between the policy and value network; they just have the same structure but they are two separate networks. Under this condition, shouldn't the training operation be broken up into an policy training operation (i.e. actor training) and a value training operation (i.e. critic training)? As far as I can tell, that is not implemented anywhere in the ppo2 baselines code.

This question arose from trying to debug a learning process where my value loss was orders of magnitude larger than my policy loss thus it totally dominates the combined loss function and thus the learning behaviour. This further raised the question if the value loss is even correctly calculated (see related question #765), but even if it is calculated correctly, it seems that you may want to split the training of the two networks.

The text was updated successfully, but these errors were encountered:

opherlieber · 2018-12-19T16:25:28Z

It should be equivalent. If network #1 doesn't effect value_loss then value_loss has no effect on the gradients of network #1, it's like adding a constant to the loss. So there is no need to have 2 implementations for the loss.

The following 2 flows should have the same result:
loss = policy_loss + value_loss -> back-propagation
loss = policy_loss -> back-propagation, loss = value_loss -> back-propagation

rallen10 · 2018-12-19T22:16:41Z

Thanks for the feedback. I'm still bringing myself up to speed so I want to make sure I check my understanding on this:

I think I can buy your claim that the policy network does not affect the value_loss, therefore the gradient of the value_loss with respect to policy network parameters = 0. It's less clear that the converse is true, though. That is to ask the question, do the value network parameters affect the policy_loss?

My thought process on this is as follows: the value network affects the advantage function (e.g. Eqn 10-12 of original paper), and the advantage function affects the policy_loss (Eqn 7), therefore the value network affects the policy_loss. In turn this would mean that the gradient of the policy_loss with respect to value network parameters is non-zero and your argument would not hold.

Counter-point: I think the fault in that thought process is the fact that the advantage function is only affected by the "old" value network, i.e. the value network parameters that were in place during the rollout that generated the training data, and are therefore static/constant with respect to the gradient descent step. This would mean that your argument holds up.

Sorry for the stream of consciousness but I figured I rather lay it all out so someone can correct me where I am wrong.

opherlieber · 2018-12-20T06:49:25Z

Yes right, the advantages are inputted as constants to the loss calculation, so they are not back-connected to the value network in the computation graph anymore, even though they originally derived from there. They do affect the policy-loss gradients but not the value networks.

* SAC can setup model without environment If given an `action_space` and an `observation_space`, SAC can now setup the model without the environment and is consistent with TD3/DDPG. * Update changelog.rst * Update changelog.rst Co-authored-by: Antonin RAFFIN <[email protected]>

rallen10 closed this as completed Dec 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO2 Why combine loss function when parameters not shared between policy and value? #766

PPO2 Why combine loss function when parameters not shared between policy and value? #766

rallen10 commented Dec 18, 2018 •

edited

Loading

opherlieber commented Dec 19, 2018

rallen10 commented Dec 19, 2018

opherlieber commented Dec 20, 2018

PPO2 Why combine loss function when parameters not shared between policy and value? #766

PPO2 Why combine loss function when parameters not shared between policy and value? #766

Comments

rallen10 commented Dec 18, 2018 • edited Loading

opherlieber commented Dec 19, 2018

rallen10 commented Dec 19, 2018

opherlieber commented Dec 20, 2018

rallen10 commented Dec 18, 2018 •

edited

Loading