-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Support fully transparent sleep mode #11743
Conversation
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
thanks to @cennn who helped me a lot along the way. |
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
this pr benefits a lot from pytorch/pytorch#131152 and pytorch/pytorch#124807 |
TODO: in distributed inference, there's also NCCL memory to be considered. need to check how much memory that takes, and if we need to do anything to release that part (it might be quite difficult, as NCCL is quite black box) |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Make sense. Will provide the reset capability to prefix caching. btw meanwhile, please raise an error if sleep mode level 2 is used and prefix caching is enabled in this PR, so that it can be unblocked. |
Signed-off-by: youkaichao <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM. Approve to unblock first but other's comments are welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM when I skimmed through the PR. Since this PR is quite isolated, I don't have a concern in merging it.
Will take a closer look a few days later once I have more bandwidth.
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
there is a strong need to put vllm in sleep mode (offload weight, discard kv cache), from rlhf community. see #10714 and #11638
I have implemented the core functionality in https://github.com/vllm-project/vllm_allocator_adaptor , and add the integration in this PR.
Currently, we support:
NOTE: we do nothing for cudagraph memory right now.
Because the underlying
cumem
API is very low-level, this PR is also compatible with cudagraph.With this PR, when sleeping mode is enabled:
the current vLLM instance can use total_gpu_memory (79.22GiB) x gpu_memory_utilization (0.90) = 71.29GiB
model weights take 2.32GiB; non_torch_memory takes 0.67GiB; PyTorch activation peak memory takes 1.20GiB; the rest of the memory reserved for KV Cache is 67.11GiB.
Free memory before sleep: 8.34 GiB
Free memory after sleep: 78.29 GiB
(70GiB memory released, which is the sum of model weights and KV Cache size)
Why free memory after sleep is not 80 GiB? Because cudagraph memory pool is not released, and there are some other cost such as cuda context.
TODO: