-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix fully sharded LoRAs with Mixtral #11390
[Bugfix] Fix fully sharded LoRAs with Mixtral #11390
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your fixing. Currently Mixtral is not being tested on CI. I assume you have already tested it locally.
Thanks for your review @jeejeelee! I ran the test LoRA Mixtral tests locally (before and after change) as well as test interaction with a booted vllm instance using Mixtral with LoRA adapters. I can look for and run additional Mixtral tests in the vllm testsuite locally if these aren’t always run. |
@jeejeelee These all passed too ( I ran the other Mixtral tests outside of tests/lora (in addition to the ones in tests/lora)) It looks like the reported CI failures in entrypoints-test are also failing on main so those are unrelated. |
- Changes ReplicatedLinearWithLoRA to always apply regardless of the fully sharded LoRA setting, since in both cases the layer needs to be replicated - Updates the existing mixtral all modeuls test to test both values of fully_sharded_loras (which includes a ReplicatedLayer [gate]) Signed-off-by: Jason Greene <[email protected]>
d08cbf3
to
b13f3b9
Compare
Signed-off-by: Jason Greene <[email protected]>
Signed-off-by: Jason Greene <[email protected]>
Signed-off-by: Jason Greene <[email protected]>
Fixes a regression introduced by #9008 , which leads to an assertion error when
--fully-sharded-loras
is enabled with an adaptor that includes a gate target:This occurs because Mixtral includes a ReplicatedLinear layer for the MoE gate, and #7081 marked it as
@_not_fully_sharded_can_replace
. Since these are per-GPU and the implementation looks safe I assume this was just an unintentional copy of the decorator.This PR removes it so that ReplicatedLinearWithLora will replace ReplicatedLinear regardless of whether fully shared LoRAs are enabled, and updates the test scenario to cover both values.
Let me know if I am missing anything.
Thanks!