-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[torch.compile] allow candidate compile sizes #10984
Conversation
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: youkaichao <[email protected]>
assert args.compilation_config.level == 3 | ||
|
||
# set to json | ||
args = parser.parse_args(['--compilation-config={"level": 3}']) | ||
# set to string form of a dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously this is a json string, but i find json string is too restricted, e.g. we need to use double quotes, we cannot have trailing comma, etc.
therefore here we switch to string form of a python dict.
With this PR, we finally get throughput improvement: $ python benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 8
Throughput: 42.91 requests/s, 21972.32 total tokens/s, 10986.16 output tokens/s
init engine (profile, create kv cache, warmup model) took 14.49 seconds
$ python benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 16
Throughput: 43.78 requests/s, 22417.37 total tokens/s, 11208.69 output tokens/s
$ python benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 32
Throughput: 44.31 requests/s, 22688.84 total tokens/s, 11344.42 output tokens/s
$ python benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 64
Throughput: 44.74 requests/s, 22905.20 total tokens/s, 11452.60 output tokens/s
$ python benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 128
Throughput: 44.43 requests/s, 22747.24 total tokens/s, 11373.62 output tokens/s
# the best --num-scheduler-steps is 64 , compare it with using torch.compile :
$ python benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 64 -O "{'level': 3, 'candidate_compile_sizes': [232, 256]}"
Throughput: 46.13 requests/s, 23617.97 total tokens/s, 11808.98 output tokens/s
init engine (profile, create kv cache, warmup model) took 88.60 seconds The throughput improvement: 44.74 requests/s --> 46.13 requests/s If we just look at the decode (output) throughput: 11452.60 output tokens/s --> 11808.98 output tokens/s Note: the key here, is to use multi-step scheduling, so that the model execution loop is busy, and the benefit of faster model execution can be exposed. If we benchmark it without multi-step scheduling, then the overhead of scheduling just shadows the benefit of faster model execution. The compilation sizes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Usage 1:
vllm serve meta-llama/Meta-Llama-3-8B -O 3
, no compilation for specific sizesUsage 2:
vllm serve meta-llama/Meta-Llama-3-8B -O "{'level': 3, 'candidate_compile_sizes': [1, 2, 4]}"
, compile for size [1, 2, 4]Usage 3:
vllm serve meta-llama/Meta-Llama-3-8B -O "{'level': 3, 'candidate_compile_sizes': [$(seq -s, 1 5)]}"
, compile for size [1, 2, 4]. (the command$(seq -s, 1 5)
expands to1, 2, 3, 4, 5
, and 1&3 are removed because they are not cudagraph sizes. this allows users to easily specify compiling sizes no larger than 5) .