You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This seems to inflate attention_entropy, and make attention_varentropy really low. Intuitively, as output sequences get closer to the maximum sequence length, attention entropy will massively collapse as there are fewer 0s in the scores, and the attention_varentropy will increase. This also means that attention_probs for future tokens are non-zero in calculate_metrics() after the softmax. This may make the thresholds set in the frog branch sampler.py worse calibrated as sequences get longer.
In testing (in the jax ipynb), I've found that masking the future attention scores with your default mask value pushes attention_entropy and attention_varentropy to slightly more stable (if not increasing, which is what you'd expect if there are more tokens to attend to), with higher sequence lengths.
I've been trying to figure out attention_entropy, please let me know if I've misunderstood.
Scores include attention to the whole possible sequence length (shape: (1, 32, 1, 4096)) with future tokens having a score of 0:
This seems to inflate attention_entropy, and make attention_varentropy really low. Intuitively, as output sequences get closer to the maximum sequence length, attention entropy will massively collapse as there are fewer 0s in the scores, and the attention_varentropy will increase. This also means that attention_probs for future tokens are non-zero in
calculate_metrics()
after the softmax. This may make the thresholds set in the frog branch sampler.py worse calibrated as sequences get longer.In testing (in the jax ipynb), I've found that masking the future attention scores with your default mask value pushes attention_entropy and attention_varentropy to slightly more stable (if not increasing, which is what you'd expect if there are more tokens to attend to), with higher sequence lengths.
The text was updated successfully, but these errors were encountered: