Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the tensor shapes #1

Closed
typoverflow opened this issue Jan 12, 2023 · 3 comments
Closed

Questions about the tensor shapes #1

typoverflow opened this issue Jan 12, 2023 · 3 comments

Comments

@typoverflow
Copy link

typoverflow commented Jan 12, 2023

Hi,
thanks for sharing the code! I have been trying to migrate the code to pytorch, but after some reading some questions come up to me:

  1. The advantage is normalized with batch mean and std. I'm wondering whether subtracting batch mean here will cause trouble to the estimation of over-estimation since this may change the symbol of the advantage in that batch.

    advantages = tf.stop_gradient((advantages - tf.reduce_mean(advantages)) / tf.math.reduce_std(advantages))

  2. The shape of the loss is somehow weird to me. If I get it right, the supervise loss supervised_loss = self._model.train_loss is summed over dim=0(the dimension of ensembles) and averaged over dim=1,2, but the adversarial loss is summed over the batch dimension since in Line 967 both the advantages and log_prob has the shape [256, ]. This is strange to me because we normally don't sum over the batch dimension.

Please correct me if there is any misunderstanding.

@marc-rigter
Copy link
Owner

Hey,

Thanks for your interest in this project, it would be fantastic if there is an implementation of this in pytorch!

  1. Subtracting the mean is common practice in advantage normalisation. You can find this for example, in Open AI Baselines (Understanding normalization of advantage function openai/baselines#544). I believe the justification for this is that it essentially the same as subtracting a baseline from the policy gradient loss. So long as the baseline does not depend on the action, this does not bias the gradient in expectation (see for example, the "Reinforce with Baseline" Section in the Sutton and Barto RL textbook.

I'm not sure, if in practice, this subtraction makes any difference: I don't think I tested without it as I just implemented the normalisation the same as the Open AI Baselines.

  1. Thanks for pointing this out. You're right, I have implemented this in a strange way, which I did not realise before you pointed it out. From what I can tell, it seems that the "total_loss" that is fed to the optimizer is of shape [256,] corresponding to the loss for each sample in the batch, rather than the mean across the batch dimension. I should have used reduce_mean to make this a scalar.

It's unclear to me from looking at the documentation how Tensorflow handles this vector/array of losses: whether the loss is reduced to a scalar by taking a sum or a mean. Do you know if it is definitely a sum? If it is the sum, then I guess the implication is that the scale of the gradients will depend on the batch size, so that if the batch size is changed, this might require a change in the learning rate to account for this. This is certainly not ideal as it will make things more difficult to tune if the batch size is modified. However, I won't modify the code in this repo for now as I don't think this is a major issue, and I want to leave it the same as the code that generated the results in the paper.

I hope this makes sense, and please let me know if I have gotten something wrong! Thanks again for your interest in this work :)

@typoverflow
Copy link
Author

Do you know if it is definitely a sum?

I'm not so familiar with tf, so I just tried with the following snippet:

>>> with tf.GradientTape() as g:
...   x = tf.constant([1,2,3], dtype=tf.float32)
...   g.watch(x)
...   y1 = tf.reduce_sum(x)
...   y2 = tf.reduce_sum(x)
...   y3 = [y1, y2]
...
>>> grads = g.gradient(y3, x)
>>> tf.Session().run(grads)
array([2., 2., 2.], dtype=float32)

The return gradient indicates sum over the batch dimension.

If it is the sum, then I guess the implication is that the scale of the gradients will depend on the batch size, so that if the batch size is changed, this might require a change in the learning rate to account for this.

Actually both a change in the learning rate and the weighting term for adversarial loss versus supervised learning loss are required. But as long as the batch size is kept the same (256) to what produces the numbers in the paper, I don't think this is a big problem.

@marc-rigter
Copy link
Owner

Ah cool, that makes sense. Thanks a lot for pointing this out and investigating this. Sorry for the slight error!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants