You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for your excellent work! I've been trying to reproduce your results using GPT-2-base, following the methods outlined in your paper. However, during training, I encountered NaN loss values, and I also noticed that the INT8 model converged more slowly compared to the float model. Could you please share the specific training configuration you used for the INT8 version of GPT-2? Your assistance would be greatly appreciated.
Thank you in advance!
The text was updated successfully, but these errors were encountered:
I could get non-NaN results on pretraining GPT-2 124M but with much higher loss than authors reported.
Specifically, I got ~5.17 with INT8 Jetfire vs. ~2.84 of nanoGPT bf16.
I used same learning rate of 6e-4, as provided in nanoGPT for GPT-2 124M.
Thank you for your excellent work! I've been trying to reproduce your results using GPT-2-base, following the methods outlined in your paper. However, during training, I encountered NaN loss values, and I also noticed that the INT8 model converged more slowly compared to the float model. Could you please share the specific training configuration you used for the INT8 version of GPT-2? Your assistance would be greatly appreciated.
Thank you in advance!
The text was updated successfully, but these errors were encountered: