-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about the tensor shapes #1
Comments
Hey, Thanks for your interest in this project, it would be fantastic if there is an implementation of this in pytorch!
I'm not sure, if in practice, this subtraction makes any difference: I don't think I tested without it as I just implemented the normalisation the same as the Open AI Baselines.
It's unclear to me from looking at the documentation how Tensorflow handles this vector/array of losses: whether the loss is reduced to a scalar by taking a sum or a mean. Do you know if it is definitely a sum? If it is the sum, then I guess the implication is that the scale of the gradients will depend on the batch size, so that if the batch size is changed, this might require a change in the learning rate to account for this. This is certainly not ideal as it will make things more difficult to tune if the batch size is modified. However, I won't modify the code in this repo for now as I don't think this is a major issue, and I want to leave it the same as the code that generated the results in the paper. I hope this makes sense, and please let me know if I have gotten something wrong! Thanks again for your interest in this work :) |
I'm not so familiar with tf, so I just tried with the following snippet: >>> with tf.GradientTape() as g:
... x = tf.constant([1,2,3], dtype=tf.float32)
... g.watch(x)
... y1 = tf.reduce_sum(x)
... y2 = tf.reduce_sum(x)
... y3 = [y1, y2]
...
>>> grads = g.gradient(y3, x)
>>> tf.Session().run(grads)
array([2., 2., 2.], dtype=float32) The return gradient indicates sum over the batch dimension.
Actually both a change in the learning rate and the weighting term for adversarial loss versus supervised learning loss are required. But as long as the batch size is kept the same (256) to what produces the numbers in the paper, I don't think this is a big problem. |
Ah cool, that makes sense. Thanks a lot for pointing this out and investigating this. Sorry for the slight error! |
Hi,
thanks for sharing the code! I have been trying to migrate the code to pytorch, but after some reading some questions come up to me:
The advantage is normalized with batch mean and std. I'm wondering whether subtracting batch mean here will cause trouble to the estimation of over-estimation since this may change the symbol of the advantage in that batch.
rambo/rambo/algorithms/rambo.py
Line 966 in 0f46d46
The shape of the loss is somehow weird to me. If I get it right, the supervise loss
supervised_loss = self._model.train_loss
is summed overdim=0
(the dimension of ensembles) and averaged overdim=1,2
, but the adversarial loss is summed over the batch dimension since in Line 967 both theadvantages
andlog_prob
has the shape[256, ]
. This is strange to me because we normally don't sum over the batch dimension.Please correct me if there is any misunderstanding.
The text was updated successfully, but these errors were encountered: