Use Avx10.2 Instructions in Floating Point Conversions #111775

khushal1996 · 2025-01-24T03:51:36Z

Overview

This PR tracks optimizing x64 floating point to integer conversions using the new saturating instructions introduced in AVX10.2. We are following the spec doc to add the new instructions and optimize the x64/x86 conversions.

Testing

All of the changes made for testing are present in this branch

Step 1: Run superpmi.exe on library mch files using JITLateDisasm to check if any errors occur. Use JITLateDisasm to check for a valid decoding of the byte stream through LLVM disasmbler

For this step, a new coredistools was used built from the LLVM repo. After running superpmi with JITLateDisasm, no decoding failures were detected. Please contact for getting access to the superpmi logs.

Step 2: Run superpmi and check for asmdiffs and assert errors.

Below is the summary of superpmi run between this PR and PR #111209

[21:29:59] Summary of Code Size diffs:
[21:29:59] (Lower is better)
[21:29:59] 
[21:29:59] Total bytes of base: 3703239 (overridden on cmd)
[21:29:59] Total bytes of diff: 3702902 (overridden on cmd)
[21:29:59] Total bytes of delta: -337 (-0.01 % of base)
[21:29:59]     diff is an improvement.
[21:29:59]     relative diff is an improvement.
[21:29:59] 
[21:29:59] 
[21:29:59] Top file improvements (bytes):
[21:29:59]          -92 : 10769.dasm (-33.58% of base)
[21:29:59]          -82 : 10999.dasm (-20.60% of base)
[21:29:59]          -41 : 11097.dasm (-13.58% of base)
[21:29:59]          -41 : 11022.dasm (-1.66% of base)
[21:29:59]          -41 : 9956.dasm (-35.96% of base)
[21:29:59]          -40 : 9295.dasm (-8.05% of base)
[21:29:59] 
[21:29:59] 6 total files with Code Size differences (6 improved, 0 regressed), 0 unchanged.
[21:29:59] 
[21:29:59] Top method improvements (bytes):
[21:29:59]          -92 (-33.58% of base) : 10769.dasm - System.Convert:ToInt32(double):int (FullOpts)
[21:29:59]          -82 (-20.60% of base) : 10999.dasm - System.Collections.Hashtable:.ctor(int,float):this (FullOpts)
[21:29:59]          -41 (-35.96% of base) : 9956.dasm - System.Collections.HashHelpers:IsPrime(int):ubyte (FullOpts)
[21:29:59]          -41 (-1.66% of base) : 11022.dasm - System.Number:Dragon4(ulong,int,uint,ubyte,int,ubyte,System.Span`1[ubyte],byref):uint (FullOpts)
[21:29:59]          -41 (-13.58% of base) : 11097.dasm - System.Number+Grisu3:GetCachedPowerForBinaryExponentRange(int,int,byref):System.Number+DiyFp (FullOpts)
[21:29:59]          -40 (-8.05% of base) : 9295.dasm - System.Threading.ProcessorIdCache:ProcessorNumberSpeedCheck():ubyte (FullOpts)
[21:29:59] 
[21:29:59] Top method improvements (percentages):
[21:29:59]          -41 (-35.96% of base) : 9956.dasm - System.Collections.HashHelpers:IsPrime(int):ubyte (FullOpts)
[21:29:59]          -92 (-33.58% of base) : 10769.dasm - System.Convert:ToInt32(double):int (FullOpts)
[21:29:59]          -82 (-20.60% of base) : 10999.dasm - System.Collections.Hashtable:.ctor(int,float):this (FullOpts)
[21:29:59]          -41 (-13.58% of base) : 11097.dasm - System.Number+Grisu3:GetCachedPowerForBinaryExponentRange(int,int,byref):System.Number+DiyFp (FullOpts)
[21:29:59]          -40 (-8.05% of base) : 9295.dasm - System.Threading.ProcessorIdCache:ProcessorNumberSpeedCheck():ubyte (FullOpts)
[21:29:59]          -41 (-1.66% of base) : 11022.dasm - System.Number:Dragon4(ulong,int,uint,ubyte,int,ubyte,System.Span`1[ubyte],byref):uint (FullOpts)
[21:29:59] 
[21:29:59] 6 total methods with Code Size differences (6 improved, 0 regressed).
[21:29:59] 
[21:29:59] --------------------------------------------------------------------------------
[21:29:59] 6 contexts with diffs (6 size improvements, 0 size regressions, 0 same size)
[21:29:59]                       (6 PerfScore improvements, 0 PerfScore regressions, 0 same PerfScore)
[21:29:59]   -337 bytes
[21:29:59]   -11.49% PerfScore

Diff makes sense here. All of the diffs in superpmi logs belong to conversion scenario. E.g.

@@ -32,18 +30,12 @@ G_M1064_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 						;; size=10 bbWeight=1 PerfScore 3.33
 G_M1064_IG03:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
        vucomisd xmm0, qword ptr [reloc @RWD00]
-       jb       G_M1064_IG09
-       vmovaps  xmm1, xmm0
-       vfixupimmsd xmm1, xmm0, qword ptr [reloc @RWD16], 0
-       vcmppd   k1, xmm1, xmmword ptr [reloc @RWD32], 13
-       vcvttsd2si eax, xmm1
-       vpbroadcastd xmm1, eax
-       vpblendmd xmm1 {k1}, xmm1, dword ptr [reloc @RWD48] {1to4}
-       vmovd    eax, xmm1
+       jb       SHORT G_M1064_IG09
+       vcvttsd2sis eax, xmm0
        vxorps   xmm1, xmm1, xmm1
        vcvtsi2sd xmm1, xmm1, eax
        vsubsd   xmm0, xmm0, xmm1
-       vmovsd   xmm1, qword ptr [reloc @RWD56]
+       vmovsd   xmm1, qword ptr [reloc @RWD08]
        vucomisd xmm1, xmm0
        ja       SHORT G_M1064_IG04
        vucomisd xmm0, xmm1
@@ -51,7 +43,7 @@ G_M1064_IG03:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byre
        jne      SHORT G_M1064_IG05
        test     al, 1
        je       SHORT G_M1064_IG05
-						;; size=102 bbWeight=0.50 PerfScore 22.42
+						;; size=54 bbWeight=0.50 PerfScore 15.79
 G_M1064_IG04:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
        dec      eax
 						;; size=2 bbWeight=0.50 PerfScore 0.12
@@ -61,20 +53,14 @@ G_M1064_IG05:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byre
        ret      
 						;; size=6 bbWeight=0.50 PerfScore 0.88
 G_M1064_IG06:        ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref, isz
-       vmovsd   xmm1, qword ptr [reloc @RWD64]
+       vmovsd   xmm1, qword ptr [reloc @RWD16]
        vucomisd xmm1, xmm0
        jbe      SHORT G_M1064_IG09
-       vmovaps  xmm1, xmm0
-       vfixupimmsd xmm1, xmm0, qword ptr [reloc @RWD16], 0
-       vcmppd   k1, xmm1, xmmword ptr [reloc @RWD32], 13
-       vcvttsd2si eax, xmm1
-       vpbroadcastd xmm1, eax
-       vpblendmd xmm1 {k1}, xmm1, dword ptr [reloc @RWD48] {1to4}
-       vmovd    eax, xmm1
+       vcvttsd2sis eax, xmm0
        vxorps   xmm1, xmm1, xmm1
        vcvtsi2sd xmm1, xmm1, eax
        vsubsd   xmm0, xmm0, xmm1
-       vmovsd   xmm1, qword ptr [reloc @RWD72]
+       vmovsd   xmm1, qword ptr [reloc @RWD24]
        vucomisd xmm0, xmm1
        ja       SHORT G_M1064_IG07
        vucomisd xmm0, xmm1
@@ -82,7 +68,7 @@ G_M1064_IG06:        ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=0000
        jne      SHORT G_M1064_IG08
        test     al, 1
        je       SHORT G_M1064_IG08
-						;; size=102 bbWeight=0.50 PerfScore 22.92
+						;; size=58 bbWeight=0.50 PerfScore 16.29
 G_M1064_IG07:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
        inc      eax
 						;; size=2 bbWeight=0.50 PerfScore 0.12
@@ -113,17 +99,12 @@ G_M1064_IG09:        ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0000 {}

Since these diffs are expected, we can conclude that the superpmi run is successful

Step 3: Run the JIT test suite using a stable subset of tests on SDE

Results

Optimized ASM

Note: Below is a case by case basis of comparison between asm generated for Avx512 vs Avx10.2. The Avx10v2 asm has been collected in sde.