Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] hook-based support for distributed but shared parameter #243

Merged
merged 9 commits into from
Jan 4, 2024

Conversation

stadlmax
Copy link
Collaborator

@stadlmax stadlmax commented Nov 21, 2023

Modulus Pull Request

Description

#235 asks for support for distributed parameters which are shared across the model-parallel group. Initially, there was the idea of using a wrapper idea. The hook-based approach, however, should be less intrusive and more flexible and is introduced in this draft. Shared weights are simply marked or unmarked as needed which registers a gradient hook taking care of the necessary reduction of gradients.

closes #235 , but with a different implementation idea

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

Copy link
Collaborator

@akshaysubr akshaysubr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good to me. Suggested some relatively minor changes.

The only other major comment I have is to also evaluate DDP reduction hooks vs these tensor hooks and whether or not that approach is preferable due to the explicit non-blocking communication routines.

modulus/distributed/utils.py Show resolved Hide resolved
modulus/distributed/utils.py Show resolved Hide resolved
modulus/distributed/utils.py Outdated Show resolved Hide resolved
modulus/distributed/utils.py Show resolved Hide resolved
test/distributed/test_utils.py Outdated Show resolved Hide resolved
test/distributed/test_utils.py Outdated Show resolved Hide resolved
test/distributed/test_utils.py Show resolved Hide resolved
test/distributed/test_utils.py Show resolved Hide resolved
test/distributed/test_utils.py Show resolved Hide resolved
@stadlmax
Copy link
Collaborator Author

stadlmax commented Dec 8, 2023

/blossom-ci

@stadlmax stadlmax changed the title [DRAFT] hook-based support for distributed but shared parameter [FEA] hook-based support for distributed but shared parameter Dec 8, 2023
@stadlmax stadlmax requested a review from mnabian December 8, 2023 16:59
@stadlmax stadlmax marked this pull request as ready for review December 8, 2023 17:14
@akshaysubr
Copy link
Collaborator

/blossom-ci

@stadlmax stadlmax added 4 - In Review Currently Under Review distributed Distributed and model parallel tools labels Dec 12, 2023
@stadlmax stadlmax self-assigned this Dec 12, 2023
@akshaysubr
Copy link
Collaborator

/blossom-ci

@akshaysubr
Copy link
Collaborator

/blossom-ci

@NickGeneva NickGeneva added the ! - Release PRs or Issues releating to a release label Jan 2, 2024
@stadlmax
Copy link
Collaborator Author

stadlmax commented Jan 2, 2024

/blossom-ci

2 similar comments
@mnabian
Copy link
Collaborator

mnabian commented Jan 2, 2024

/blossom-ci

@NickGeneva
Copy link
Collaborator

/blossom-ci

@mnabian
Copy link
Collaborator

mnabian commented Jan 4, 2024

/blossom-ci

@stadlmax stadlmax merged commit 3ccdcbf into NVIDIA:main Jan 4, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - In Review Currently Under Review ! - Release PRs or Issues releating to a release distributed Distributed and model parallel tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

🚀[FEA]: Add a wrapper class for shared tensors in model parallel implementations
4 participants