-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support for mcore optimizer (to enable MoE) #380
base: dev
Are you sure you want to change the base?
Conversation
56163e1
to
4fd8963
Compare
The DPO dataset changes should stand on their own, but are needed to test the mcore opt changes for moe. If moe issues take too long to resolve, I'll break this up. |
9a48bf1
to
b9cf184
Compare
e3c9c88
to
c18de8d
Compare
b1bca36
to
ccce0b3
Compare
ccce0b3
to
7e93e37
Compare
e897fd7
to
eab96f2
Compare
Signed-off-by: Terry Kong <[email protected]> moe test is all2all Signed-off-by: Terry Kong <[email protected]> other params Signed-off-by: Terry Kong <[email protected]> fix peft mixtral Signed-off-by: Terry Kong <[email protected]> dockerfile bump to be on dev Signed-off-by: Terry Kong <[email protected]> just take dockerfile on dev Signed-off-by: Terry Kong <[email protected]>
eab96f2
to
1c98a08
Compare
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
# from multiple simultaneous NCCL calls | ||
ptl_model._optimizer._finish_bucket_grad_sync() | ||
# Mcore DistOpt handles this, so we don't have to | ||
if not ptl_model.use_mcore_dist_optim: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you feel about dropping support for non mcore dist optim? are they equivalent to apex now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, i want to do that in a follow up PR (would help our build times immensely). This just adds the feature without breaking apex
@@ -150,7 +150,7 @@ def train_single_step(self, batch): | |||
grad_norm = grad_norm.item() if torch.is_tensor(grad_norm) else grad_norm | |||
lr = self.optimizer.param_groups[0]["lr"] | |||
|
|||
self.optimizer.step() | |||
self.optimizer.step(closure=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have to specify the closure now? i thought this was optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for mcore dist opt it's required, so i just set it everywhere
Signed-off-by: Terry Kong <[email protected]>
What does this PR do ?
Rebase stack
Changelog
Usage
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
Checklist when contributing a new algorithm
max_steps=-1
andvalidation
?Additional Information