-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] add triton fused moe kernel for gptq/awq #12185
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
91b41c6
to
87e191f
Compare
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
21c1d8d
to
99f23f2
Compare
Signed-off-by: Jinzhen Lin <[email protected]>
55102d9
to
15ae02b
Compare
@mgoin @robertgshaw2-redhat Could we expedite this PR + #12036 (not sure if #12204 is needed too or has overlap) now that DeepSeek has released their full lineup? |
I created a new PR with better |
I think this PR could be closed in favor of #12222. Thanks for your work @jinzhen-lin |
#12222 is an optimiztion over #12036 or #12204, it can be combined with this PR to get a better performance. |
Thank you for the work! We will take a look now |
top_k: tl.constexpr, | ||
compute_type: tl.constexpr, | ||
has_zp: tl.constexpr, | ||
use_int4_w8a16: tl.constexpr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all of these should be renamed to use_int4_w4a16
class MoeQuantIntConfig(QuantizationConfig): | ||
"""Config class for Int8 experts quantization.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this comment
Is there any more specific name we could use for this method? I also feel that --quantization moe_quant_int
is not clear. Maybe you could change to --quantization moe_wNa16
and MoeWNA16Config
? Open to other names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moe_wNa16 is a better name, I would change it.
def get_quant_method(self, layer: torch.nn.Module, | ||
prefix: str) -> Optional["QuantizeMethodBase"]: | ||
if is_layer_skipped_quant(prefix, self.modules_to_not_convert): | ||
return UnquantizedLinearMethod() | ||
elif isinstance(layer, LinearBase): | ||
if self.linear_quant_method == "gptq": | ||
gptq_config = GPTQMarlinConfig.from_config(self.full_config) | ||
return GPTQMarlinLinearMethod(gptq_config) | ||
elif self.linear_quant_method == "awq": | ||
awq_config = AWQMarlinConfig.from_config(self.full_config) | ||
return AWQMarlinLinearMethod(awq_config) | ||
else: | ||
raise ValueError("moe_quant_int only support gptq and awq.") | ||
elif isinstance(layer, FusedMoE): | ||
return MoeQuantIntMethod(self) | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting hack - I wonder if we could just enable the MoeQuantIntMethod
as a condition inside of the other quantization methods rather than duplicating them here in this config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this before, but I created a new quantization method finally. The reasons are
- This quantization method can be combined all gptq/awq quanzation methods, should we add it to all quanzaition methods that supported gptq/awq, or just gptq-marlin/awq-marlin ?
- This quantization method and triton kernel use a different weight format, it is just compatible with gptq/awq, and accept gpt/awq weight.
- Make the code more clear and easy to maintenance (less duplication)
def convert_awq_tensor(tensor, tensor_type): | ||
size0 = tensor.size(0) | ||
tensor = tensor.view(torch.uint8) | ||
shifter = torch.tensor([0, 4], | ||
dtype=torch.uint8, | ||
device=tensor.device) | ||
tensor = (tensor[:, :, None] >> shifter) & 0xF | ||
tensor = tensor.view(-1, | ||
8)[:, | ||
[0, 4, 1, 5, 2, 6, 3, 7]].view(size0, -1) | ||
tensor = tensor.T.contiguous() | ||
if tensor_type == "qweight": | ||
tensor = tensor[:, 1::2] * 16 + tensor[:, ::2] | ||
elif tensor_type == "qzeros": | ||
tensor = tensor[1::2, :] * 16 + tensor[::2, :] | ||
return tensor | ||
|
||
def convert_gptq_int4_qzeros(tensor): | ||
tensor = tensor.view(torch.uint8) | ||
shifter = torch.tensor([0, 4], | ||
dtype=torch.uint8, | ||
device=tensor.device) | ||
tensor = (tensor[:, :, None] >> shifter) & 0xF | ||
tensor = tensor + 1 | ||
tensor = tensor[:, :, 0] + tensor[:, :, 1] * 16 | ||
return tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to have a short description of each transformation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I would add description later.
- use_int8_w8a16 (bool): If True, use matmul of int8 weight and bf16/fp16 | ||
activation to compute the inner products for w1 and w2. | ||
Defaults to False. | ||
- use_int4_w8a16 (bool): If True, use matmul of int4 weight and bf16/fp16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- use_int4_w8a16 (bool): If True, use matmul of int4 weight and bf16/fp16 | |
- use_int4_w4a16 (bool): If True, use matmul of int4 weight and bf16/fp16 |
@triton.jit | ||
def fused_moe_kernel_gptq_awq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's quite a bit of code duplication between this and fused_moe_kernel
- Not necessarily a blocker for this PR but IMO we should refactor and unify these kernels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the beginning I tried to modify the fused_moe_kernel
, and found that this made this origin code very complex (with many complex conditions) and hard to read. So I created a new function finally. Not sure what is the best way.
Considering that this is allowing for "another option" to run quantized moe models, maybe we should consider writing a documentation page specifically for moe quantization. I think the best case for this kernel to be used more broadly would be to have a heuristic on the number of experts or some configuration to decide whether to use the triton or marlin kernel |
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
I test with small moe model (https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4) just now, triton kernel seems much faster than marlin kernel too. Besides, marlin kernel seems generate wrong result for Test result on A100 * 1: marlin kernel:
triton kernel
Maybe we should set triton kernel as default moe gptq/awq kernel? But I am not sure how to do this, gptq-marlin-moe is a part of gpt-marlin quanzation method, if I change moe kernel of gptq-marlin method, user cannot use gptq-marlin-moe anyway. Is that ok? |
The current only option for using moe+gptq/awq is the Marlin kernel, but for the Marlin kernel, a single
marlin_gemm_moe
would launchingnum_experts
CUDA kernels at least, while the fused_moe triton kernel only needs to launch one cuda kernel. This makes the Marlin kernel significantly slower than the fused_moe triton kernel.This PR adds support for fused_moe triton kernel with gptq/awq.
Generation speed of deepseek-v3-awq (8*A100-SXM4-80GB, bs=1, short prompt)
Note:
moe_align_block_size
kernel support for deepseek-v3