-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhanced gemma prediction with new flawless logit #51
base: main
Are you sure you want to change the base?
Conversation
PiperOrigin-RevId: 663277444 Change-Id: I8d7030ce586577a433c48f32df7efa7c141b171a
…ormer_lib.make_causal_attn_mask(input_mask)` PiperOrigin-RevId: 663692225 Change-Id: Ie2cb6229302087ea1ce5b5c7f442a088207ead07
PiperOrigin-RevId: 665414923 Change-Id: I42bc41074518e3065f85c7f1a3014fdd09cffe4c
Currently all weights in FeedForward layers are initialized to zero. This doesn't cause any issues when loading the module with pretrained weights, but if training from scratch it will result in all gradients being zero throughout training so no learning can occur. Changing w_gating be be initialized from a normal distribution fixes this. PiperOrigin-RevId: 674306730 Change-Id: I90800dbe605cdf88f341d103f102357ff278a393
PiperOrigin-RevId: 674394389 Change-Id: I25ba5ad4769c3101c2bf572e33723d4a241e3895
…se errors for implicit rank promotion. PiperOrigin-RevId: 675179053 Change-Id: I55459c1aa99c7d33ae3f03712eaed01ccc5fc9f2
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
The GitHub CLA check doesn't recognize the noreply user @a-googler <no****ly@google.com>. How shall I proceed? Should I use an interactive rebase to edit the author of the related commits? |
Integration of a
flawless_logit
By normalizing the logits for each token, the model ensures that the predictions are more balanced and less likely to be dominated by any single token.
Subtracting the normalized sum can help reduce biases and make the logits more representative of the actual distribution of the data.
Based on initial tests conducted on Gemma2 7B, it appears that the performance at inference time has been improved.