Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Avx10.2 Instructions in Floating Point Conversions #111775

Open
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

khushal1996
Copy link
Contributor

@khushal1996 khushal1996 commented Jan 24, 2025

Overview

This PR tracks optimizing x64 floating point to integer conversions using the new saturating instructions introduced in AVX10.2. We are following the spec doc to add the new instructions and optimize the x64/x86 conversions.

Testing

All of the changes made for testing are present in this branch


Step 1: Run superpmi.exe on library mch files using JITLateDisasm to check if any errors occur. Use JITLateDisasm to check for a valid decoding of the byte stream through LLVM disasmbler

For this step, a new coredistools was used built from the LLVM repo. After running superpmi with JITLateDisasm, no decoding failures were detected. Please contact for getting access to the superpmi logs.


Step 2: Run superpmi and check for asmdiffs and assert errors.

Below is the summary of superpmi run between this PR and PR #111209

[21:29:59] Summary of Code Size diffs:
[21:29:59] (Lower is better)
[21:29:59] 
[21:29:59] Total bytes of base: 3703239 (overridden on cmd)
[21:29:59] Total bytes of diff: 3702902 (overridden on cmd)
[21:29:59] Total bytes of delta: -337 (-0.01 % of base)
[21:29:59]     diff is an improvement.
[21:29:59]     relative diff is an improvement.
[21:29:59] 
[21:29:59] 
[21:29:59] Top file improvements (bytes):
[21:29:59]          -92 : 10769.dasm (-33.58% of base)
[21:29:59]          -82 : 10999.dasm (-20.60% of base)
[21:29:59]          -41 : 11097.dasm (-13.58% of base)
[21:29:59]          -41 : 11022.dasm (-1.66% of base)
[21:29:59]          -41 : 9956.dasm (-35.96% of base)
[21:29:59]          -40 : 9295.dasm (-8.05% of base)
[21:29:59] 
[21:29:59] 6 total files with Code Size differences (6 improved, 0 regressed), 0 unchanged.
[21:29:59] 
[21:29:59] Top method improvements (bytes):
[21:29:59]          -92 (-33.58% of base) : 10769.dasm - System.Convert:ToInt32(double):int (FullOpts)
[21:29:59]          -82 (-20.60% of base) : 10999.dasm - System.Collections.Hashtable:.ctor(int,float):this (FullOpts)
[21:29:59]          -41 (-35.96% of base) : 9956.dasm - System.Collections.HashHelpers:IsPrime(int):ubyte (FullOpts)
[21:29:59]          -41 (-1.66% of base) : 11022.dasm - System.Number:Dragon4(ulong,int,uint,ubyte,int,ubyte,System.Span`1[ubyte],byref):uint (FullOpts)
[21:29:59]          -41 (-13.58% of base) : 11097.dasm - System.Number+Grisu3:GetCachedPowerForBinaryExponentRange(int,int,byref):System.Number+DiyFp (FullOpts)
[21:29:59]          -40 (-8.05% of base) : 9295.dasm - System.Threading.ProcessorIdCache:ProcessorNumberSpeedCheck():ubyte (FullOpts)
[21:29:59] 
[21:29:59] Top method improvements (percentages):
[21:29:59]          -41 (-35.96% of base) : 9956.dasm - System.Collections.HashHelpers:IsPrime(int):ubyte (FullOpts)
[21:29:59]          -92 (-33.58% of base) : 10769.dasm - System.Convert:ToInt32(double):int (FullOpts)
[21:29:59]          -82 (-20.60% of base) : 10999.dasm - System.Collections.Hashtable:.ctor(int,float):this (FullOpts)
[21:29:59]          -41 (-13.58% of base) : 11097.dasm - System.Number+Grisu3:GetCachedPowerForBinaryExponentRange(int,int,byref):System.Number+DiyFp (FullOpts)
[21:29:59]          -40 (-8.05% of base) : 9295.dasm - System.Threading.ProcessorIdCache:ProcessorNumberSpeedCheck():ubyte (FullOpts)
[21:29:59]          -41 (-1.66% of base) : 11022.dasm - System.Number:Dragon4(ulong,int,uint,ubyte,int,ubyte,System.Span`1[ubyte],byref):uint (FullOpts)
[21:29:59] 
[21:29:59] 6 total methods with Code Size differences (6 improved, 0 regressed).
[21:29:59] 
[21:29:59] --------------------------------------------------------------------------------
[21:29:59] 6 contexts with diffs (6 size improvements, 0 size regressions, 0 same size)
[21:29:59]                       (6 PerfScore improvements, 0 PerfScore regressions, 0 same PerfScore)
[21:29:59]   -337 bytes
[21:29:59]   -11.49% PerfScore

Diff makes sense here. All of the diffs in superpmi logs belong to conversion scenario. E.g.

@@ -32,18 +30,12 @@ G_M1064_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 						;; size=10 bbWeight=1 PerfScore 3.33
 G_M1064_IG03:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
        vucomisd xmm0, qword ptr [reloc @RWD00]
-       jb       G_M1064_IG09
-       vmovaps  xmm1, xmm0
-       vfixupimmsd xmm1, xmm0, qword ptr [reloc @RWD16], 0
-       vcmppd   k1, xmm1, xmmword ptr [reloc @RWD32], 13
-       vcvttsd2si eax, xmm1
-       vpbroadcastd xmm1, eax
-       vpblendmd xmm1 {k1}, xmm1, dword ptr [reloc @RWD48] {1to4}
-       vmovd    eax, xmm1
+       jb       SHORT G_M1064_IG09
+       vcvttsd2sis eax, xmm0
        vxorps   xmm1, xmm1, xmm1
        vcvtsi2sd xmm1, xmm1, eax
        vsubsd   xmm0, xmm0, xmm1
-       vmovsd   xmm1, qword ptr [reloc @RWD56]
+       vmovsd   xmm1, qword ptr [reloc @RWD08]
        vucomisd xmm1, xmm0
        ja       SHORT G_M1064_IG04
        vucomisd xmm0, xmm1
@@ -51,7 +43,7 @@ G_M1064_IG03:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byre
        jne      SHORT G_M1064_IG05
        test     al, 1
        je       SHORT G_M1064_IG05
-						;; size=102 bbWeight=0.50 PerfScore 22.42
+						;; size=54 bbWeight=0.50 PerfScore 15.79
 G_M1064_IG04:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
        dec      eax
 						;; size=2 bbWeight=0.50 PerfScore 0.12
@@ -61,20 +53,14 @@ G_M1064_IG05:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byre
        ret      
 						;; size=6 bbWeight=0.50 PerfScore 0.88
 G_M1064_IG06:        ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref, isz
-       vmovsd   xmm1, qword ptr [reloc @RWD64]
+       vmovsd   xmm1, qword ptr [reloc @RWD16]
        vucomisd xmm1, xmm0
        jbe      SHORT G_M1064_IG09
-       vmovaps  xmm1, xmm0
-       vfixupimmsd xmm1, xmm0, qword ptr [reloc @RWD16], 0
-       vcmppd   k1, xmm1, xmmword ptr [reloc @RWD32], 13
-       vcvttsd2si eax, xmm1
-       vpbroadcastd xmm1, eax
-       vpblendmd xmm1 {k1}, xmm1, dword ptr [reloc @RWD48] {1to4}
-       vmovd    eax, xmm1
+       vcvttsd2sis eax, xmm0
        vxorps   xmm1, xmm1, xmm1
        vcvtsi2sd xmm1, xmm1, eax
        vsubsd   xmm0, xmm0, xmm1
-       vmovsd   xmm1, qword ptr [reloc @RWD72]
+       vmovsd   xmm1, qword ptr [reloc @RWD24]
        vucomisd xmm0, xmm1
        ja       SHORT G_M1064_IG07
        vucomisd xmm0, xmm1
@@ -82,7 +68,7 @@ G_M1064_IG06:        ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=0000
        jne      SHORT G_M1064_IG08
        test     al, 1
        je       SHORT G_M1064_IG08
-						;; size=102 bbWeight=0.50 PerfScore 22.92
+						;; size=58 bbWeight=0.50 PerfScore 16.29
 G_M1064_IG07:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
        inc      eax
 						;; size=2 bbWeight=0.50 PerfScore 0.12
@@ -113,17 +99,12 @@ G_M1064_IG09:        ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0000 {}

Since these diffs are expected, we can conclude that the superpmi run is successful


Step 3: Run the JIT test suite using a stable subset of tests on SDE

Results
image

Optimized ASM


Note: Below is a case by case basis of comparison between asm generated for Avx512 vs Avx10.2. The Avx10v2 asm has been collected in sde.

Case: Float to Int packed

** Test code**

public class Program
{
    [MethodImplAttribute(MethodImplOptions.NoInlining)]
    public static Vector128<int> FloatToInt(Vector128<float> val)
    {
        return Vector128.ConvertToInt32(val);
    }

    public static void Main(string[] args)
    {
        Console.WriteLine(FloatToInt(Vector128.Create((float)4.5)));
    }
}

Left Side is AVX512 vs Right Side is AVX10.2
image

Case: Float to UInt packed

** Test code**

public class Program
{
    [MethodImplAttribute(MethodImplOptions.NoInlining)]
    public static Vector128<uint> FloatToUInt(Vector128<float> val)
    {
        return Vector128.ConvertToUInt32(val);
    }

    public static void Main(string[] args)
    {
        Console.WriteLine(FloatToUInt(Vector128.Create((float)4.5)));
    }
}

Left Side is AVX512 vs Right Side is AVX10.2
image

Case: Double to long packed

** Test code**

public class Program
{
    [MethodImplAttribute(MethodImplOptions.NoInlining)]
    public static Vector128<long> DoubleToLong(Vector128<double> val)
    {
        return Vector128.ConvertToInt64(val);
    }

    public static void Main(string[] args)
    {
        Console.WriteLine(DoubleToLong(Vector128.Create((double)4.5)));
    }
}

Left Side is AVX512 vs Right Side is AVX10.2
image

Case: Double to Ulong packed

** Test code**

public class Program
{
    [MethodImplAttribute(MethodImplOptions.NoInlining)]
    public static Vector128<ulong> DoubleToULong(Vector128<double> val)
    {
        return Vector128.ConvertToUInt64(val);
    }

    public static void Main(string[] args)
    {
        Console.WriteLine(DoubleToULong(Vector128.Create((double)4.5)));
    }
}

Left Side is AVX512 vs Right Side is AVX10.2
image

Case: Float to Int Scalar

** Test code**

public class Program
{
    [MethodImplAttribute(MethodImplOptions.NoInlining)]
    public static int FloatToInt(float val)
    {
        return (int)val;
    }

    public static void Main(string[] args)
    {
        Console.WriteLine(FloatToInt(float)-4.5f));
    }
}

Left Side is AVX512 vs Right Side is AVX10.2
image

Case: Float to UInt Scalar

** Test code**

public class Program
{
    [MethodImplAttribute(MethodImplOptions.NoInlining)]
    public static uint FloatToUInt(float val)
    {
        return (uint)val;
    }

    public static void Main(string[] args)
    {
        Console.WriteLine(FloatToUInt(float)-4.5f));
    }
}

Left Side is AVX512 vs Right Side is AVX10.2
image

Case: Float to Long Scalar

** Test code**

public class Program
{
    [MethodImplAttribute(MethodImplOptions.NoInlining)]
    public static long FloatToLong(float val)
    {
        return (long)val;
    }

    public static void Main(string[] args)
    {
        Console.WriteLine(FloatToLong(float)-4.5f));
    }
}

Left Side is AVX512 vs Right Side is AVX10.2
image

Case: Float to ULong Scalar

** Test code**

public class Program
{
    [MethodImplAttribute(MethodImplOptions.NoInlining)]
    public static ulong FloatToULong(float val)
    {
        return (ulong)val;
    }

    public static void Main(string[] args)
    {
        Console.WriteLine(FloatToULong(float)-4.5f));
    }
}

Left Side is AVX512 vs Right Side is AVX10.2
image

Case: Double to Long Scalar

** Test code**

public class Program
{
    [MethodImplAttribute(MethodImplOptions.NoInlining)]
    public static long DoubleToLong(double val)
    {
        return (long)val;
    }

    public static void Main(string[] args)
    {
        Console.WriteLine(DoubleToLong(double)-4.5));
    }
}

Left Side is AVX512 vs Right Side is AVX10.2
image

Case: Double to ULong Scalar

** Test code**

public class Program
{
    [MethodImplAttribute(MethodImplOptions.NoInlining)]
    public static ulong DoubleToULong(double val)
    {
        return (ulong)val;
    }

    public static void Main(string[] args)
    {
        Console.WriteLine(DoubleToULong(double)-4.5));
    }
}

Left Side is AVX512 vs Right Side is AVX10.2
image

Case: Double to int Scalar

** Test code**

public class Program
{
    [MethodImplAttribute(MethodImplOptions.NoInlining)]
    public static int DoubleToInt(double val)
    {
        return (int)val;
    }

    public static void Main(string[] args)
    {
        Console.WriteLine(DoubleToInt(double)-4.5));
    }
}

Left Side is AVX512 vs Right Side is AVX10.2
image

Case: Double to UInt Scalar

** Test code**

    [MethodImplAttribute(MethodImplOptions.NoInlining)]
    public static uint DoubleToUInt(double val)
    {
        return (uint)val;
    }

    public static void Main(string[] args)
    {
        Console.WriteLine(DoubleToUInt(double)-4.5));
    }
}

Left Side is AVX512 vs Right Side is AVX10.2
image

khushal1996 and others added 30 commits January 8, 2025 11:58
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation labels Jan 24, 2025
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

1 similar comment
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jan 24, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@khushal1996 khushal1996 marked this pull request as ready for review January 24, 2025 22:08
@@ -372,6 +373,8 @@ GenTree* Compiler::fgMorphExpandCast(GenTreeCast* tree)
#else
#if defined(TARGET_AMD64)
// Following nodes are handled when lowering the nodes
// float -> ulong/uint/int/long fro AVX10.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// float -> ulong/uint/int/long fro AVX10.2
// float -> ulong/uint/int/long for AVX10.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants