-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] Kernel Benchmark for RMSNorm
#11241
Conversation
Co-authored-by: Xiaoyu Zhang <[email protected]> Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Here are my results just running on H100:
python benchmark_rmsnorm.py
...
rmsnorm-perf-without-residual:
head_num batch_size seq_len HuggingFace FlashInfer vLLM
0 32.0 1.0 64.0 52.703999 9.792000 11.744000
1 32.0 1.0 128.0 46.208002 11.648000 13.824000
2 32.0 1.0 256.0 52.928001 12.032000 14.272000
3 32.0 1.0 512.0 64.736001 14.208000 18.784000
4 32.0 1.0 1024.0 91.807999 19.872000 27.584000
5 32.0 4.0 64.0 53.056002 13.120000 14.656000
6 32.0 4.0 128.0 65.920003 14.688000 19.200001
7 32.0 4.0 256.0 94.463997 19.904001 27.327999
8 32.0 4.0 512.0 184.064001 31.168001 46.176001
9 32.0 4.0 1024.0 333.792001 60.864002 97.152002
10 32.0 16.0 64.0 92.896000 19.680001 27.200000
11 32.0 16.0 128.0 183.904007 30.975999 45.791999
12 32.0 16.0 256.0 332.704008 60.864002 97.184002
13 32.0 16.0 512.0 618.336022 109.024003 179.296002
14 32.0 16.0 1024.0 1192.352057 205.791995 343.456000
15 32.0 64.0 64.0 333.472013 60.896002 97.280003
16 32.0 64.0 128.0 617.824018 109.024003 179.296002
17 32.0 64.0 256.0 1192.288041 205.888003 343.423992
18 32.0 64.0 512.0 2335.776091 399.295986 671.711981
19 32.0 64.0 1024.0 4625.023842 789.951980 1330.960035
20 48.0 1.0 64.0 48.608001 10.144000 12.704000
21 48.0 1.0 128.0 53.056002 11.744000 14.848000
22 48.0 1.0 256.0 62.463999 13.504000 16.303999
23 48.0 1.0 512.0 80.959998 17.503999 22.720000
24 48.0 1.0 1024.0 142.752007 26.208000 34.623999
25 48.0 4.0 64.0 62.431999 13.504000 16.272001
26 48.0 4.0 128.0 80.159999 17.535999 22.752000
27 48.0 4.0 256.0 143.360004 26.144000 34.527998
28 48.0 4.0 512.0 266.719997 52.703999 72.031997
29 48.0 4.0 1024.0 476.480007 93.631998 133.056000
30 48.0 16.0 64.0 142.848000 26.144000 34.623999
31 48.0 16.0 128.0 266.128004 52.687999 72.095998
32 48.0 16.0 256.0 477.151990 93.567997 133.151993
33 48.0 16.0 512.0 904.640019 173.823997 249.696001
34 48.0 16.0 1024.0 1763.872027 334.975988 484.351993
35 48.0 64.0 64.0 477.344006 93.631998 133.248001
36 48.0 64.0 128.0 903.551996 173.840001 249.791995
37 48.0 64.0 256.0 1764.448047 334.895998 484.320015
38 48.0 64.0 512.0 3475.487947 658.688009 953.232050
39 48.0 64.0 1024.0 6898.752213 1307.775974 1889.744043
Clearly flashinfer seems to offer a benefit with these configurations.
Needed to install flashinfer with: uv pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4/
seq_len=128, | ||
hidden_size=4096, | ||
use_residual=args.use_residual) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making these configurable through args would be perfect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point! @jeejeelee
Added in 71af57f
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
This is very good to know. The RMSNorm kernel (and the RoPE kernel) is not optimized enough. We should replace it with either the flash infer kernel or the Triton kernel. |
Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Xiaoyu Zhang <[email protected]>
Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Xiaoyu Zhang <[email protected]>
Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Xiaoyu Zhang <[email protected]>
This PR ports the RMSNorm kernel benchmark authored by @BBuf in sgl-project/sglang#2486 to vLLM repo to compare kernel differences between our custom op and flashinfer.
Co-authored-by: @BBuf