Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Misc] Kernel Benchmark for RMSNorm #11241

Merged
merged 5 commits into from
Dec 17, 2024
Merged

Conversation

ywang96
Copy link
Member

@ywang96 ywang96 commented Dec 16, 2024

This PR ports the RMSNorm kernel benchmark authored by @BBuf in sgl-project/sglang#2486 to vLLM repo to compare kernel differences between our custom op and flashinfer.

Co-authored-by: @BBuf

ywang96 and others added 3 commits December 16, 2024 12:56
Co-authored-by: Xiaoyu Zhang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 16, 2024
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Here are my results just running on H100:

python benchmark_rmsnorm.py
...
rmsnorm-perf-without-residual:
    head_num  batch_size  seq_len  HuggingFace   FlashInfer         vLLM
0       32.0         1.0     64.0    52.703999     9.792000    11.744000
1       32.0         1.0    128.0    46.208002    11.648000    13.824000
2       32.0         1.0    256.0    52.928001    12.032000    14.272000
3       32.0         1.0    512.0    64.736001    14.208000    18.784000
4       32.0         1.0   1024.0    91.807999    19.872000    27.584000
5       32.0         4.0     64.0    53.056002    13.120000    14.656000
6       32.0         4.0    128.0    65.920003    14.688000    19.200001
7       32.0         4.0    256.0    94.463997    19.904001    27.327999
8       32.0         4.0    512.0   184.064001    31.168001    46.176001
9       32.0         4.0   1024.0   333.792001    60.864002    97.152002
10      32.0        16.0     64.0    92.896000    19.680001    27.200000
11      32.0        16.0    128.0   183.904007    30.975999    45.791999
12      32.0        16.0    256.0   332.704008    60.864002    97.184002
13      32.0        16.0    512.0   618.336022   109.024003   179.296002
14      32.0        16.0   1024.0  1192.352057   205.791995   343.456000
15      32.0        64.0     64.0   333.472013    60.896002    97.280003
16      32.0        64.0    128.0   617.824018   109.024003   179.296002
17      32.0        64.0    256.0  1192.288041   205.888003   343.423992
18      32.0        64.0    512.0  2335.776091   399.295986   671.711981
19      32.0        64.0   1024.0  4625.023842   789.951980  1330.960035
20      48.0         1.0     64.0    48.608001    10.144000    12.704000
21      48.0         1.0    128.0    53.056002    11.744000    14.848000
22      48.0         1.0    256.0    62.463999    13.504000    16.303999
23      48.0         1.0    512.0    80.959998    17.503999    22.720000
24      48.0         1.0   1024.0   142.752007    26.208000    34.623999
25      48.0         4.0     64.0    62.431999    13.504000    16.272001
26      48.0         4.0    128.0    80.159999    17.535999    22.752000
27      48.0         4.0    256.0   143.360004    26.144000    34.527998
28      48.0         4.0    512.0   266.719997    52.703999    72.031997
29      48.0         4.0   1024.0   476.480007    93.631998   133.056000
30      48.0        16.0     64.0   142.848000    26.144000    34.623999
31      48.0        16.0    128.0   266.128004    52.687999    72.095998
32      48.0        16.0    256.0   477.151990    93.567997   133.151993
33      48.0        16.0    512.0   904.640019   173.823997   249.696001
34      48.0        16.0   1024.0  1763.872027   334.975988   484.351993
35      48.0        64.0     64.0   477.344006    93.631998   133.248001
36      48.0        64.0    128.0   903.551996   173.840001   249.791995
37      48.0        64.0    256.0  1764.448047   334.895998   484.320015
38      48.0        64.0    512.0  3475.487947   658.688009   953.232050
39      48.0        64.0   1024.0  6898.752213  1307.775974  1889.744043

Clearly flashinfer seems to offer a benefit with these configurations.

Needed to install flashinfer with: uv pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4/

seq_len=128,
hidden_size=4096,
use_residual=args.use_residual)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making these configurable through args would be perfect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! @jeejeelee

Added in 71af57f

Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
@ywang96 ywang96 enabled auto-merge (squash) December 17, 2024 05:57
@WoosukKwon
Copy link
Collaborator

This is very good to know. The RMSNorm kernel (and the RoPE kernel) is not optimized enough. We should replace it with either the flash infer kernel or the Triton kernel.

@ywang96 ywang96 merged commit 02222a0 into vllm-project:main Dec 17, 2024
35 checks passed
BKitor pushed a commit to BKitor/vllm that referenced this pull request Dec 30, 2024
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
joennlae pushed a commit to 44ai-labs/vllm that referenced this pull request Jan 19, 2025
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
joennlae pushed a commit to 44ai-labs/vllm that referenced this pull request Jan 19, 2025
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants