[Bugfix] support to run partially 2:4 model with CompressedTensors24 scheme #11889

jiangjiadi · 2025-01-09T08:06:01Z

After using llmcompressor to create a partially 2:4 sparse FP8 quantized model (where the MLP layers are sparse, but the attention layers are not), I tested its speed and found that the inference speed of this model did not differ from that of a regular FP8 quantized model. Further investigation revealed that although the MLP layers are 2:4 sparse, the get_scheme function in the CompressedTensorsConfig of the code repository does not handle this partially sparse model appropriately, causing the MLP layers to not utilize the CompressedTensors24 scheme.
Before:

Fixed:

FIX vllm-project/llm-compressor#1037

Tested_model:

original qwen2.5-3B
fp8 quantized qwen2.5-3B
fully 2:4 sparse fp8 quantized qwen2.5-3B
partially 2:4 sparse fp8 quantized qwen2.5-3B

github-actions · 2025-01-09T08:06:13Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mgoin · 2025-01-09T14:18:26Z

cc @dsikka @rahul-tuli

support to run partially 2:4 model with CompressedTensors24 scheme

63c7bbb

jiangjiadi mentioned this pull request Jan 9, 2025

Does llmcompressor support hybrid sparsity? vllm-project/llm-compressor#1037

Open

format fix

9dcd004

jeejeelee requested a review from mgoin January 9, 2025 08:26

remove the duplited definition of sparsity_scheme

a337d3b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] support to run partially 2:4 model with CompressedTensors24 scheme #11889

[Bugfix] support to run partially 2:4 model with CompressedTensors24 scheme #11889

jiangjiadi commented Jan 9, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 9, 2025

mgoin commented Jan 9, 2025

[Bugfix] support to run partially 2:4 model with CompressedTensors24 scheme #11889

Are you sure you want to change the base?

[Bugfix] support to run partially 2:4 model with CompressedTensors24 scheme #11889

Conversation

jiangjiadi commented Jan 9, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 9, 2025

mgoin commented Jan 9, 2025

jiangjiadi commented Jan 9, 2025 •

edited by github-actions bot

Loading