[FEA] hook-based support for distributed but shared parameter #243

stadlmax · 2023-11-21T17:03:46Z

Modulus Pull Request

Description

#235 asks for support for distributed parameters which are shared across the model-parallel group. Initially, there was the idea of using a wrapper idea. The hook-based approach, however, should be less intrusive and more flexible and is introduced in this draft. Shared weights are simply marked or unmarked as needed which registers a gradient hook taking care of the necessary reduction of gradients.

closes #235 , but with a different implementation idea

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.

Dependencies

akshaysubr

This PR looks good to me. Suggested some relatively minor changes.

The only other major comment I have is to also evaluate DDP reduction hooks vs these tensor hooks and whether or not that approach is preferable due to the explicit non-blocking communication routines.

modulus/distributed/utils.py

test/distributed/test_utils.py

…ed-w

modulus/distributed/autograd.py

stadlmax · 2023-12-08T16:58:15Z

/blossom-ci

akshaysubr · 2023-12-11T19:54:00Z

/blossom-ci

…ed-w

akshaysubr · 2023-12-14T17:13:47Z

/blossom-ci

akshaysubr · 2023-12-15T18:17:35Z

/blossom-ci

stadlmax · 2024-01-02T19:47:25Z

/blossom-ci

mnabian · 2024-01-02T21:01:06Z

/blossom-ci

NickGeneva · 2024-01-02T21:09:35Z

/blossom-ci

mnabian · 2024-01-04T18:31:59Z

/blossom-ci

add draft for hook-based parameter sharing

04e6476

akshaysubr reviewed Dec 6, 2023

View reviewed changes

stadlmax added 2 commits December 8, 2023 08:27

Merge branch 'main' of github.com:stadlmax/modulus into fea-dist-shar…

e1ec2ab

…ed-w

address feedback

3e6db8e

stadlmax commented Dec 8, 2023

View reviewed changes

modulus/distributed/autograd.py Show resolved Hide resolved

stadlmax changed the title ~~[DRAFT] hook-based support for distributed but shared parameter~~ [FEA] hook-based support for distributed but shared parameter Dec 8, 2023

stadlmax requested a review from mnabian December 8, 2023 16:59

stadlmax marked this pull request as ready for review December 8, 2023 17:14

akshaysubr approved these changes Dec 11, 2023

View reviewed changes

stadlmax mentioned this pull request Dec 12, 2023

🚀[FEA]: Add a wrapper class for shared tensors in model parallel implementations #235

Closed

stadlmax added 4 - In Review Currently Under Review distributed Distributed and model parallel tools labels Dec 12, 2023

stadlmax self-assigned this Dec 12, 2023

Merge branch 'main' of github.com:stadlmax/modulus into fea-dist-shar…

df0e266

…ed-w

stadlmax mentioned this pull request Dec 14, 2023

🐛[BUG]: MeshGraphNet multiGPU test failure #278

Closed

fix multigpu tests

4203736

fix trailing whitespace in changelog

1620177

mnabian approved these changes Dec 15, 2023

View reviewed changes

stadlmax added 2 commits January 2, 2024 10:17

fix import

fedf689

resolve conflicts

9ef8980

NickGeneva added the ! - Release PRs or Issues releating to a release label Jan 2, 2024

Merge branch 'main' into fea-dist-shared-w

8d19a1b

stadlmax merged commit 3ccdcbf into NVIDIA:main Jan 4, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] hook-based support for distributed but shared parameter #243

[FEA] hook-based support for distributed but shared parameter #243

stadlmax commented Nov 21, 2023 •

edited

Loading

akshaysubr left a comment

stadlmax commented Dec 8, 2023

akshaysubr commented Dec 11, 2023

akshaysubr commented Dec 14, 2023

akshaysubr commented Dec 15, 2023

stadlmax commented Jan 2, 2024

mnabian commented Jan 2, 2024

NickGeneva commented Jan 2, 2024

mnabian commented Jan 4, 2024

[FEA] hook-based support for distributed but shared parameter #243

[FEA] hook-based support for distributed but shared parameter #243

Conversation

stadlmax commented Nov 21, 2023 • edited Loading

Modulus Pull Request

Description

Checklist

Dependencies

akshaysubr left a comment

Choose a reason for hiding this comment

stadlmax commented Dec 8, 2023

akshaysubr commented Dec 11, 2023

akshaysubr commented Dec 14, 2023

akshaysubr commented Dec 15, 2023

stadlmax commented Jan 2, 2024

mnabian commented Jan 2, 2024

NickGeneva commented Jan 2, 2024

mnabian commented Jan 4, 2024

stadlmax commented Nov 21, 2023 •

edited

Loading