[Core] Support fully transparent sleep mode #11743

youkaichao · 2025-01-05T07:44:46Z

there is a strong need to put vllm in sleep mode (offload weight, discard kv cache), from rlhf community. see #10714 and #11638

I have implemented the core functionality in https://github.com/vllm-project/vllm_allocator_adaptor , and add the integration in this PR.

Currently, we support:

offload model weights
discard kv cache

NOTE: we do nothing for cudagraph memory right now.

Because the underlying cumem API is very low-level, this PR is also compatible with cudagraph.

With this PR, when sleeping mode is enabled:

the current vLLM instance can use total_gpu_memory (79.22GiB) x gpu_memory_utilization (0.90) = 71.29GiB
model weights take 2.32GiB; non_torch_memory takes 0.67GiB; PyTorch activation peak memory takes 1.20GiB; the rest of the memory reserved for KV Cache is 67.11GiB.

Free memory before sleep: 8.34 GiB
Free memory after sleep: 78.29 GiB
(70GiB memory released, which is the sum of model weights and KV Cache size)

Why free memory after sleep is not 80 GiB? Because cudagraph memory pool is not released, and there are some other cost such as cuda context.

TODO:

move https://github.com/vllm-project/vllm_allocator_adaptor into vllm repo (need someone familiar with CMake. Note that I don't use any pytorch ops in that file, so binding should be easy)
expose the interface in API server, similar to profiling endpoints (can be a follow-up PR, and we need to discuss how to expose these endpoints. It should live under some dev-mode, with explicit opt-in)
drop existing block manager's content when prefix-caching is enabled (after a second thought, I think this is out of the scope of the current PR. but we do need a way to expose this API interface)
add the feature in V1
check if any features are incompatible with this sleeping mode. I guess it should just work for cuda platforms, but need to check. If that is true, we can enable it by default. (NOTE: I find some bugs about PyTorch's pluggable allocator, see empty_cache does not work for CUDAPluggableAllocator + MemPool pytorch/pytorch#145168 . Therefore we should not enable it by default.)

Signed-off-by: youkaichao <[email protected]>

github-actions · 2025-01-05T07:44:57Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

youkaichao · 2025-01-05T07:45:54Z

thanks to @cennn who helped me a lot along the way.

Signed-off-by: youkaichao <[email protected]>

youkaichao · 2025-01-05T13:58:32Z

this pr benefits a lot from pytorch/pytorch#131152 and pytorch/pytorch#124807

youkaichao · 2025-01-07T02:59:33Z

TODO: in distributed inference, there's also NCCL memory to be considered. need to check how much memory that takes, and if we need to do anything to release that part (it might be quite difficult, as NCCL is quite black box)

mergify · 2025-01-18T12:34:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @youkaichao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: youkaichao <[email protected]>

comaniac · 2025-01-20T07:27:34Z

Did a quick pass and overall LGTM. A high level question: would there be a correctness issue with prefix caching?

If prefix caching is enabled, and we sleep with level 2 (discard the weights), there will be correctness issue. That's why I'm asking for help to reset prefix caching state.

Make sense. Will provide the reset capability to prefix caching.

btw meanwhile, please raise an error if sleep mode level 2 is used and prefix caching is enabled in this PR, so that it can be unblocked.

Signed-off-by: youkaichao <[email protected]>

youkaichao · 2025-01-20T08:10:55Z

btw meanwhile, please raise an error if sleep mode level 2 is used and prefix caching is enabled in this PR, so that it can be unblocked.

@comaniac fixed in b371bf3

mergify · 2025-01-21T19:53:24Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @youkaichao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

comaniac

Otherwise LGTM. Approve to unblock first but other's comments are welcome.

vllm/config.py

vllm/device_allocator/cumem.py

vllm/entrypoints/llm.py

vllm/executor/executor_base.py

WoosukKwon

LGTM when I skimmed through the PR. Since this PR is quite isolated, I don't have a concern in merging it.

Will take a closer look a few days later once I have more bandwidth.

Signed-off-by: youkaichao <[email protected]>

youkaichao · 2025-01-22T05:38:04Z

Otherwise LGTM. Approve to unblock first but other's comments are welcome.

@comaniac thanks for the very detailed comments! Looking forward to #12284 being merged after this PR.

Signed-off-by: youkaichao <[email protected]>

youkaichao added 12 commits January 4, 2025 13:16

add code

814095e

Signed-off-by: youkaichao <[email protected]>

add basic tests

d559772

Signed-off-by: youkaichao <[email protected]>

add basic tests

5189a29

Signed-off-by: youkaichao <[email protected]>

fix tests

d6c1bb9

Signed-off-by: youkaichao <[email protected]>

fix tests

d00a99f

Signed-off-by: youkaichao <[email protected]>

add cudagraph tests

e18b239

Signed-off-by: youkaichao <[email protected]>

add test code

31bc20e

Signed-off-by: youkaichao <[email protected]>

enable sleeping mode for user

69262bb

Signed-off-by: youkaichao <[email protected]>

add end to end experiments

88bec78

Signed-off-by: youkaichao <[email protected]>

fix

c3d845e

Signed-off-by: youkaichao <[email protected]>

update

921b848

Signed-off-by: youkaichao <[email protected]>

add tests

09d624c

Signed-off-by: youkaichao <[email protected]>

youkaichao requested review from zhuohan123, alexm-redhat, comaniac and njhill as code owners January 5, 2025 07:44

mergify bot added ci/build frontend labels Jan 5, 2025

youkaichao added 2 commits January 5, 2025 16:02

pin version

59fbf5c

Signed-off-by: youkaichao <[email protected]>

avoid interference

39b6fa5

Signed-off-by: youkaichao <[email protected]>

youkaichao marked this pull request as draft January 18, 2025 12:34

mergify bot added the needs-rebase label Jan 18, 2025

youkaichao added 2 commits January 18, 2025 20:38

Merge branch 'main' into cumem

8245114

update

1c3fed0

Signed-off-by: youkaichao <[email protected]>

mergify bot removed the needs-rebase label Jan 18, 2025

Merge branch 'main' into cumem

f4cc888

mergify bot removed the needs-rebase label Jan 20, 2025

youkaichao added 2 commits January 20, 2025 15:08

fix?

4d6177a

Signed-off-by: youkaichao <[email protected]>

fix?

a1c5634

Signed-off-by: youkaichao <[email protected]>

youkaichao added 2 commits January 20, 2025 16:07

Merge branch 'main' into cumem

13b2213

disable level 2 with prefix caching

b371bf3

Signed-off-by: youkaichao <[email protected]>

mergify bot added the needs-rebase label Jan 21, 2025

comaniac approved these changes Jan 21, 2025

View reviewed changes

youkaichao mentioned this pull request Jan 22, 2025

[Core]: Support destroying all KV cache during runtime #10810

Closed

WoosukKwon approved these changes Jan 22, 2025

View reviewed changes

youkaichao added 2 commits January 22, 2025 13:17

Merge branch 'main' into cumem

5ff5423

use ValueError

cbdbcea

Signed-off-by: youkaichao <[email protected]>

youkaichao requested a review from robertgshaw2-redhat as a code owner January 22, 2025 05:22

mergify bot removed the needs-rebase label Jan 22, 2025

youkaichao added 6 commits January 22, 2025 13:26

doc string for sleep

4388f0f

Signed-off-by: youkaichao <[email protected]>

polish type assert

d1991e5

Signed-off-by: youkaichao <[email protected]>

doc string for use_memory_pool

daf3169

Signed-off-by: youkaichao <[email protected]>

polish type assert

23ee3ad

Signed-off-by: youkaichao <[email protected]>

docstring for sleep

a61f473

Signed-off-by: youkaichao <[email protected]>

error for prefix caching

a626d63

Signed-off-by: youkaichao <[email protected]>

youkaichao added 4 commits January 22, 2025 13:41

format

7414e0c

Signed-off-by: youkaichao <[email protected]>

format

d378a08

Signed-off-by: youkaichao <[email protected]>

use found_line

900c257

Signed-off-by: youkaichao <[email protected]>

robust tests

53bce8a

Signed-off-by: youkaichao <[email protected]>

youkaichao merged commit 68ad4e3 into vllm-project:main Jan 22, 2025
6 of 9 checks passed

youkaichao deleted the cumem branch January 22, 2025 06:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Support fully transparent sleep mode #11743

[Core] Support fully transparent sleep mode #11743

youkaichao commented Jan 5, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 5, 2025

youkaichao commented Jan 5, 2025

youkaichao commented Jan 5, 2025

youkaichao commented Jan 7, 2025

mergify bot commented Jan 18, 2025

comaniac commented Jan 20, 2025 •

edited

Loading

youkaichao commented Jan 20, 2025

mergify bot commented Jan 21, 2025

comaniac left a comment

WoosukKwon left a comment

youkaichao commented Jan 22, 2025

[Core] Support fully transparent sleep mode #11743

[Core] Support fully transparent sleep mode #11743

Conversation

youkaichao commented Jan 5, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 5, 2025

youkaichao commented Jan 5, 2025

youkaichao commented Jan 5, 2025

youkaichao commented Jan 7, 2025

mergify bot commented Jan 18, 2025

comaniac commented Jan 20, 2025 • edited Loading

youkaichao commented Jan 20, 2025

mergify bot commented Jan 21, 2025

comaniac left a comment

Choose a reason for hiding this comment

WoosukKwon left a comment

Choose a reason for hiding this comment

youkaichao commented Jan 22, 2025

youkaichao commented Jan 5, 2025 •

edited by github-actions bot

Loading

comaniac commented Jan 20, 2025 •

edited

Loading