Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4x16s4 fp32-gemm kernel have better performance than default(5x16) kernel for meteor lake #6480

Open
xujuntwt95329 opened this issue May 27, 2024 · 1 comment

Comments

@xujuntwt95329
Copy link
Contributor

XNNPACK by default uses 5x16 fp32-gemm kernel for x86_fma3, but we found that 4x16s4 kernel shows better performance on meteor lake CPU (Intel(R) Core(TM) Ultra 7 155H)

benchmark 5x16 (us) 4x16s4 (us) Reduction on inference time (%)
FP32MobileNetV1/T:1/real_time 16193 10775 33.46
FP32MobileNetV2/T:1/real_time 8809 6626 24.78
FP32MobileNetV3Large/T:1/real_time 7756 6052 21.97
FP32MobileNetV3Small/T:1/real_time 2180 1970 9.63

Here is the code to reproduce the above data: https://github.com/xujuntwt95329/XNNPACK/tree/0143aab98634c866b319decca52590e1eb54b9dd

We can submit PR if this is welcome.

@fbarchard
Copy link
Contributor

Note that this is due to Visual C register spill. clang produces better code with 5x16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants