feat: support for mcore optimizer (to enable MoE) #380

terrykong · 2024-11-05T20:48:55Z

What does this PR do ?

Finishes up support for mcore optimizer
adds tests for mixtral EP and mixtral TP + SP + peft

Rebase stack

Changelog

Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation? Make sure to also update the NeMo Framework User Guide which contains the tutorials

Checklist when contributing a new algorithm

Does the trainer resume and restore model state all states?
Does the trainer support all parallelism techniques(PP, TP, DP)?
Does the trainer support max_steps=-1 and validation?
Does the trainer only call APIs defined in alignable_interface.py?
Does the trainer have proper logging?

Additional Information

Related to # (issue)

terrykong · 2024-11-07T17:32:50Z

The DPO dataset changes should stand on their own, but are needed to test the mcore opt changes for moe. If moe issues take too long to resolve, I'll break this up.

The base branch was changed.

Signed-off-by: Terry Kong <[email protected]> moe test is all2all Signed-off-by: Terry Kong <[email protected]> other params Signed-off-by: Terry Kong <[email protected]> fix peft mixtral Signed-off-by: Terry Kong <[email protected]> dockerfile bump to be on dev Signed-off-by: Terry Kong <[email protected]> just take dockerfile on dev Signed-off-by: Terry Kong <[email protected]>

Signed-off-by: Terry Kong <[email protected]>

gshennvm · 2025-01-10T18:37:28Z

nemo_aligner/utils/train_utils.py

-        # from multiple simultaneous NCCL calls
-        ptl_model._optimizer._finish_bucket_grad_sync()
+        # Mcore DistOpt handles this, so we don't have to
+        if not ptl_model.use_mcore_dist_optim:


how do you feel about dropping support for non mcore dist optim? are they equivalent to apex now?

yea, i want to do that in a follow up PR (would help our build times immensely). This just adds the feature without breaking apex

nemo_aligner/utils/train_utils.py

gshennvm · 2025-01-10T18:46:40Z

nemo_aligner/algorithms/supervised.py

@@ -150,7 +150,7 @@ def train_single_step(self, batch):
        grad_norm = grad_norm.item() if torch.is_tensor(grad_norm) else grad_norm
        lr = self.optimizer.param_groups[0]["lr"]

-        self.optimizer.step()
+        self.optimizer.step(closure=None)


do we have to specify the closure now? i thought this was optional?

for mcore dist opt it's required, so i just set it everywhere

Signed-off-by: Terry Kong <[email protected]>

github-actions bot added Utils CI Algorithms labels Nov 5, 2024

terrykong force-pushed the tk/dpo-moe-fix branch 2 times, most recently from 56163e1 to 4fd8963 Compare November 7, 2024 17:18

terrykong changed the title ~~feat: DPO support for mcore optimizer and MoE~~ feat: support for mcore optimizer to enable MoE and support global padding to multiple for DPO Nov 7, 2024

terrykong force-pushed the tk/dpo-moe-fix branch 3 times, most recently from 9a48bf1 to b9cf184 Compare November 7, 2024 18:55

terrykong changed the title ~~feat: support for mcore optimizer to enable MoE and support global padding to multiple for DPO~~ feat: support for mcore optimizer to enable MoE Nov 7, 2024

terrykong changed the title ~~feat: support for mcore optimizer to enable MoE~~ feat: support for mcore optimizer (to enable MoE) Nov 7, 2024

terrykong requested a review from gshennvm November 7, 2024 19:51

This was referenced Nov 7, 2024

feat: DPO support for global padding of seq_len to a multiple #386

Merged

feat: adds helper script for running & summarizing functional tests #387

Merged

chore: adds missing license header on tests and corrects file modes #388

Merged

terrykong force-pushed the tk/dpo-moe-fix branch from e3c9c88 to c18de8d Compare November 7, 2024 20:13

terrykong mentioned this pull request Nov 8, 2024

Dist-optim fixes #392

Closed

8 tasks

terrykong force-pushed the tk/dpo-moe-fix branch 7 times, most recently from b1bca36 to ccce0b3 Compare November 15, 2024 07:46

terrykong requested a review from ashors1 November 15, 2024 07:48

terrykong added the Run CICD Set + un-set to retrigger label Nov 15, 2024

terrykong force-pushed the tk/dpo-moe-fix branch from ccce0b3 to 7e93e37 Compare November 15, 2024 21:05

terrykong marked this pull request as ready for review November 20, 2024 20:48

ashors1 previously approved these changes Nov 22, 2024

View reviewed changes

terrykong changed the base branch from main to dev December 2, 2024 23:47

terrykong added Run CICD Set + un-set to retrigger and removed Run CICD Set + un-set to retrigger labels Dec 3, 2024

terrykong force-pushed the tk/dpo-moe-fix branch 3 times, most recently from e897fd7 to eab96f2 Compare December 6, 2024 22:37

terrykong force-pushed the tk/dpo-moe-fix branch from eab96f2 to 1c98a08 Compare December 9, 2024 18:29

terrykong added 2 commits January 6, 2025 23:38

update dockerfile past mcore grad assertion fix

6214591

Signed-off-by: Terry Kong <[email protected]>

need to ping pynvml in dockerfile since latest release breaks trtllm v13

4f60e87

Signed-off-by: Terry Kong <[email protected]>

terrykong added Run CICD Set + un-set to retrigger and removed Run CICD Set + un-set to retrigger labels Jan 10, 2025

gshennvm reviewed Jan 10, 2025

View reviewed changes

nemo_aligner/utils/train_utils.py Show resolved Hide resolved

gshennvm reviewed Jan 10, 2025

View reviewed changes

Missing GBS from dpo.sh

7598f24

Signed-off-by: Terry Kong <[email protected]>

terrykong added Run CICD Set + un-set to retrigger and removed Run CICD Set + un-set to retrigger labels Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support for mcore optimizer (to enable MoE) #380

feat: support for mcore optimizer (to enable MoE) #380

terrykong commented Nov 5, 2024 •

edited

Loading

terrykong commented Nov 7, 2024

gshennvm Jan 10, 2025

terrykong Jan 10, 2025

gshennvm Jan 10, 2025

terrykong Jan 10, 2025

feat: support for mcore optimizer (to enable MoE) #380

Are you sure you want to change the base?

feat: support for mcore optimizer (to enable MoE) #380

Conversation

terrykong commented Nov 5, 2024 • edited Loading

What does this PR do ?

Rebase stack

Changelog

Usage

Before your PR is "Ready for review"

Checklist when contributing a new algorithm

Additional Information

terrykong commented Nov 7, 2024

gshennvm Jan 10, 2025

Choose a reason for hiding this comment

terrykong Jan 10, 2025

Choose a reason for hiding this comment

gshennvm Jan 10, 2025

Choose a reason for hiding this comment

terrykong Jan 10, 2025

Choose a reason for hiding this comment

terrykong commented Nov 5, 2024 •

edited

Loading