-
Notifications
You must be signed in to change notification settings - Fork 329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for saturating doubling multiply add #2050
Comments
It sounds reasonable to want access to this instruction. I'm curious about the purpose of the extra 2x mul in their definition? Because NEON differs from SVE in that it takes the upper/lower half, vs even/odd lanes, I think we'd want to define the op like |
The use case we've tested out is a vector multiplication so the order needs to be preserved. You raise a good point on the difference between NEON and SVE, we could potentially return two vectors - the lower and upper half of the result - this sounds like a good compromise since I'd imagine you'd need the whole vector for most use cases anyway |
Makes sense. I'm still curious why the 2x factor in the instruction, is it also for Q1.x fixed-point? (similar to MulFixedPoint15) |
yes that is correct, I'm still trying to find out why there is a 2x factor myself, I'll give an update once I have an answer |
So it looks like the doubling is just a side effect of the way the multiplication is done. The multiplication is done in Q0.15 format which is saturated to Q0.31 format and this is equivalent to doubling saturating multiply-add of two 16 bit integers |
Got it, thanks. In that case adding FixedPoint to the op name may be helpful. |
AVX3_DL is also capable of carrying out the saturated doubling multiply add using the _mm*_dpwssds_epi32 intrinsics. Here is how the vqdmla op can be implemented on AVX3_DL for I32 vectors that have 1 to 4 lanes:
The vqdmla op can be carried out on PPC8/PPC9/PPC10 using vec_msums. Here is how the vqdmla op can be implemented on PPC8/PPC9/PPC10:
|
Very nice! |
@jan-wassenberg yes, that is perfect! I've added a comment on the PR requesting support for 8 bit and 16 bit elements as well |
There is a saturating doubling multiply add instructions in NEON and SVE -
vqdmla
,svqdmla
which are equivalent tohn::SaturatedAdd(hn::Mul(2, hn::Mul(a_real, b_imag)),hn::Mul(2, hn::Mul(a_imag, b_real)));
. In testing, we have seen a significant performance decrease using highway vs NEON and highway vs SVE. I was wondering if it would be possible to add a similar instruction to highway so we are able to leverage theseqdmla
instructions.The text was updated successfully, but these errors were encountered: