Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avx512 extract most significant bits #82731

Merged
107 changes: 101 additions & 6 deletions src/coreclr/jit/emitxarch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,15 @@ bool emitter::IsSSEOrAVXInstruction(instruction ins)
return (ins >= INS_FIRST_SSE_INSTRUCTION) && (ins <= INS_LAST_AVX_INSTRUCTION);
}

bool emitter::IsKInstruction(instruction ins)
{
return (ins >= INS_FIRST_K_INSTRUCTION) && (ins <= INS_LAST_K_INSTRUCTION);
}

//------------------------------------------------------------------------
// IsAvx512OrPriorInstruction: Is this an Avx512 or Avx or Sse instruction.
// IsAvx512OrPriorInstruction: Is this an Avx512 or Avx or Sse or K (opmask) instruction.
// Technically, K instructions would be considered under the VEX encoding umbrella, but due to
// the instruction table encoding had to be pulled out with the rest of the `INST5` definitions.
tannergooding marked this conversation as resolved.
Show resolved Hide resolved
Comment on lines +44 to +45
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment needs to be updated slightly since KInstructions are part of the INST3 group now.

//
// Arguments:
// ins - The instruction to check.
Expand All @@ -46,7 +53,7 @@ bool emitter::IsSSEOrAVXInstruction(instruction ins)
bool emitter::IsAvx512OrPriorInstruction(instruction ins)
{
// TODO-XArch-AVX512: Fix check once AVX512 instructions are added.
return (ins >= INS_FIRST_SSE_INSTRUCTION) && (ins <= INS_LAST_AVX512_INSTRUCTION);
return ((ins >= INS_FIRST_SSE_INSTRUCTION) && (ins <= INS_LAST_AVX512_INSTRUCTION)) || IsKInstruction(ins);
}

bool emitter::IsAVXOnlyInstruction(instruction ins)
Expand Down Expand Up @@ -154,7 +161,7 @@ regNumber emitter::getSseShiftRegNumber(instruction ins)

bool emitter::IsVexEncodedInstruction(instruction ins) const
{
return UseVEXEncoding() && IsSSEOrAVXInstruction(ins);
return UseVEXEncoding() && (IsSSEOrAVXInstruction(ins) || IsKInstruction(ins));
}

//------------------------------------------------------------------------
Expand Down Expand Up @@ -263,6 +270,11 @@ bool emitter::IsEvexEncodedInstruction(instruction ins) const
case INS_vbroadcastf128: // INS_vbroadcastf32x4, INS_vbroadcastf64x2.
case INS_vbroadcasti128: // INS_vbroadcasti32x4, INS_vbroadcasti64x2.

case INS_kmovb:
case INS_kmovw:
case INS_kmovd:
case INS_kmovq:

// TODO-XARCH-AVX512 these need to be encoded with the proper individual EVEX instructions (movdqu8,
// movdqu16 etc)
// For implementation speed, I have set it up so the standing instruction will default to the 32-bit operand
Expand Down Expand Up @@ -1248,6 +1260,8 @@ bool emitter::TakesRexWPrefix(instruction ins, emitAttr attr)
case INS_vpgatherqq:
case INS_vgatherdpd:
case INS_vgatherqpd:
case INS_vpmovw2m:
case INS_vpmovq2m:
return true;
default:
break;
Expand Down Expand Up @@ -1307,7 +1321,7 @@ bool emitter::TakesRexWPrefix(instruction ins, emitAttr attr)
// so we never need it
if ((ins != INS_push) && (ins != INS_pop) && (ins != INS_movq) && (ins != INS_movzx) && (ins != INS_push_hide) &&
(ins != INS_pop_hide) && (ins != INS_ret) && (ins != INS_call) && (ins != INS_tail_i_jmp) &&
!((ins >= INS_i_jmp) && (ins <= INS_l_jg)))
!((ins >= INS_i_jmp) && (ins <= INS_l_jg)) && (ins != INS_kmovb) && (ins != INS_kmovw) && (ins != INS_kmovd))
{
return true;
}
Expand Down Expand Up @@ -3477,7 +3491,16 @@ inline UNATIVE_OFFSET emitter::emitInsSizeRR(instrDesc* id)
// If Byte 4 (which is 0xFF00) is zero, that's where the RM encoding goes.
// Otherwise, it will be placed after the 4 byte encoding, making the total 5 bytes.
// This would probably be better expressed as a different format or something?
code_t code = insCodeRM(ins);
code_t code;
if (IsKInstruction(ins))
{
code = insCodeRR(ins);
code = AddVexPrefix(ins, code, EA_SIZE(id->idOpSize()));
}
else
{
code = insCodeRM(ins);
}

UNATIVE_OFFSET sz = emitGetAdjustedSize(id, code);

Expand Down Expand Up @@ -5850,6 +5873,10 @@ bool emitter::IsMovInstruction(instruction ins)
case INS_movupd:
case INS_movups:
case INS_movzx:
case INS_kmovb:
case INS_kmovw:
case INS_kmovd:
case INS_kmovq:
{
return true;
}
Expand Down Expand Up @@ -5971,6 +5998,15 @@ bool emitter::HasSideEffect(instruction ins, emitAttr size)
}
#endif // TARGET_AMD64

case INS_kmovb:
case INS_kmovw:
case INS_kmovd:
case INS_kmovq:
{
hasSideEffect = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be "false" for kmovq since its a 64-bit to 64-bit move and therefore doesn't zero extend, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is notably still unresolved

break;
}

default:
{
unreached();
Expand Down Expand Up @@ -6182,6 +6218,12 @@ void emitter::emitIns_Mov(instruction ins, emitAttr attr, regNumber dstReg, regN
}
#endif // TARGET_AMD64

case INS_kmovb:
case INS_kmovw:
case INS_kmovd:
case INS_kmovq:
break;

default:
{
unreached();
Expand Down Expand Up @@ -9578,6 +9620,11 @@ const char* emitter::emitRegName(regNumber reg, emitAttr attr, bool varName)
#ifdef TARGET_AMD64
char suffix = '\0';

if (isMaskReg(reg))
{
return emitKregName(reg);
tannergooding marked this conversation as resolved.
Show resolved Hide resolved
}

switch (EA_SIZE(attr))
{
case EA_64BYTE:
Expand Down Expand Up @@ -9798,6 +9845,24 @@ const char* emitter::emitZMMregName(unsigned reg)
return regNames[reg];
}

/*****************************************************************************
*
* Return a string that represents the given K register.
*/

const char* emitter::emitKregName(unsigned reg)
{
static const char* const regNames[] = {
#define REGDEF(name, rnum, mask, sname) sname,
#include "register.h"
};

assert(reg < REG_COUNT);
assert(reg < ArrLen(regNames));

return regNames[reg];
}

/*****************************************************************************
*
* Display a static data member reference.
Expand Down Expand Up @@ -13802,7 +13867,16 @@ BYTE* emitter::emitOutputRR(BYTE* dst, instrDesc* id)
{
assert((ins != INS_movd) || (isFloatReg(reg1) != isFloatReg(reg2)));

if ((ins != INS_movd) || isFloatReg(reg1))
if (IsKInstruction(ins))
{
code = insCodeRR(ins);
if (isGeneralRegister(reg1))
{
// kmov r, k form, flip last byte of opcode from 0x92 to 0x93
code |= 0x01;
}
}
else if ((ins != INS_movd) || isFloatReg(reg1))
{
code = insCodeRM(ins);
}
Expand Down Expand Up @@ -18103,6 +18177,27 @@ emitter::insExecutionCharacteristics emitter::getInsExecutionCharacteristics(ins
break;
}
#endif

case INS_vpmovb2m:
case INS_vpmovw2m:
case INS_vpmovd2m:
case INS_vpmovq2m:
{
result.insLatency += PERFSCORE_LATENCY_1C;
result.insThroughput = PERFSCORE_THROUGHPUT_1C;
break;
}

case INS_kmovb:
case INS_kmovw:
case INS_kmovd:
case INS_kmovq:
{
result.insLatency += PERFSCORE_LATENCY_3C;
result.insThroughput = PERFSCORE_THROUGHPUT_1C;
break;
}

default:
// unhandled instruction insFmt combination
perfScoreUnhandledInstruction(id, &result);
Expand Down
7 changes: 7 additions & 0 deletions src/coreclr/jit/emitxarch.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,11 @@ inline static bool isDoubleReg(regNumber reg)
return isFloatReg(reg);
}

inline static bool isMaskReg(regNumber reg)
{
return (reg >= REG_MASK_FIRST && reg <= REG_MASK_LAST);
}

/************************************************************************/
/* Routines that compute the size of / encode instructions */
/************************************************************************/
Expand Down Expand Up @@ -96,6 +101,7 @@ static bool IsAvx512OnlyInstruction(instruction ins);
static bool IsFMAInstruction(instruction ins);
static bool IsAVXVNNIInstruction(instruction ins);
static bool IsBMIInstruction(instruction ins);
static bool IsKInstruction(instruction ins);

static regNumber getBmiRegNumber(instruction ins);
static regNumber getSseShiftRegNumber(instruction ins);
Expand Down Expand Up @@ -670,6 +676,7 @@ void emitDispShift(instruction ins, int cnt = 0);
const char* emitXMMregName(unsigned reg);
const char* emitYMMregName(unsigned reg);
const char* emitZMMregName(unsigned reg);
const char* emitKregName(unsigned reg);

/************************************************************************/
/* Private members that deal with target-dependent instr. descriptors */
Expand Down
43 changes: 43 additions & 0 deletions src/coreclr/jit/hwintrinsiccodegenxarch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1671,6 +1671,49 @@ void CodeGen::genAvxFamilyIntrinsic(GenTreeHWIntrinsic* node)
break;
}

case NI_AVX512F_MoveMaskSpecial:
{
op1Reg = op1->GetRegNum();
regNumber maskReg = node->ExtractTempReg(RBM_ALLMASK);

instruction maskIns;
instruction kmovIns;

switch (baseType)
tannergooding marked this conversation as resolved.
Show resolved Hide resolved
{
case TYP_BYTE:
case TYP_UBYTE:
maskIns = INS_vpmovb2m;
kmovIns = INS_kmovq;
break;
case TYP_SHORT:
case TYP_USHORT:
maskIns = INS_vpmovw2m;
kmovIns = INS_kmovd;
break;
case TYP_INT:
case TYP_UINT:
case TYP_FLOAT:
maskIns = INS_vpmovd2m;
kmovIns = INS_kmovw;
break;
case TYP_DOUBLE:
case TYP_LONG:
case TYP_ULONG:
maskIns = INS_vpmovq2m;
kmovIns = INS_kmovb;
break;
default:
unreached();
}

assert(emitter::isMaskReg(maskReg) && maskReg == REG_K1);
tannergooding marked this conversation as resolved.
Show resolved Hide resolved

emit->emitIns_R_R(maskIns, attr, maskReg, op1Reg);
emit->emitIns_Mov(kmovIns, EA_8BYTE, targetReg, maskReg, INS_FLAGS_DONT_CARE);
break;
}

default:
unreached();
break;
Expand Down
4 changes: 4 additions & 0 deletions src/coreclr/jit/hwintrinsiclistxarch.h
Original file line number Diff line number Diff line change
Expand Up @@ -239,6 +239,8 @@ HARDWARE_INTRINSIC(Vector256, Xor,
HARDWARE_INTRINSIC(Vector512, Create, 64, -1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_Helper, HW_Flag_SpecialImport|HW_Flag_NoCodeGen)
HARDWARE_INTRINSIC(Vector512, get_Zero, 64, 0, {INS_xorps, INS_xorps, INS_xorps, INS_xorps, INS_xorps, INS_xorps, INS_xorps, INS_xorps, INS_xorps, INS_xorps}, HW_Category_Helper, HW_Flag_SpecialImport|HW_Flag_NoCodeGen|HW_Flag_ReturnsPerElementMask)

HARDWARE_INTRINSIC(Vector512, ExtractMostSignificantBits, 64, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_Helper, HW_Flag_SpecialImport|HW_Flag_BaseTypeFromFirstArg|HW_Flag_NoCodeGen)

// ***************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
// ISA Function name SIMD size NumArg Instructions Category Flags
// {TYP_BYTE, TYP_UBYTE, TYP_SHORT, TYP_USHORT, TYP_INT, TYP_UINT, TYP_LONG, TYP_ULONG, TYP_FLOAT, TYP_DOUBLE}
Expand Down Expand Up @@ -871,6 +873,8 @@ HARDWARE_INTRINSIC(SSE2, UCOMISD,
HARDWARE_INTRINSIC(SSE41, PTEST, 16, 2, {INS_ptest, INS_ptest, INS_ptest, INS_ptest, INS_ptest, INS_ptest, INS_ptest, INS_ptest, INS_invalid, INS_invalid}, HW_Category_SimpleSIMD, HW_Flag_NoRMWSemantics|HW_Flag_NoEvexSemantics)
HARDWARE_INTRINSIC(AVX, PTEST, 0, 2, {INS_ptest, INS_ptest, INS_ptest, INS_ptest, INS_ptest, INS_ptest, INS_ptest, INS_ptest, INS_vtestps, INS_vtestpd}, HW_Category_SimpleSIMD, HW_Flag_NoRMWSemantics|HW_Flag_NoEvexSemantics)

HARDWARE_INTRINSIC(AVX512F, MoveMaskSpecial, 64, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movd, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromFirstArg|HW_Flag_NoContainment|HW_Flag_SpecialCodeGen|HW_Flag_NoRMWSemantics)
tannergooding marked this conversation as resolved.
Show resolved Hide resolved

#endif // FEATURE_HW_INTRINSIC

#undef HARDWARE_INTRINSIC
Expand Down
16 changes: 16 additions & 0 deletions src/coreclr/jit/hwintrinsicxarch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1249,6 +1249,22 @@ GenTree* Compiler::impSpecialIntrinsic(NamedIntrinsic intrinsic,
break;
}

case NI_Vector512_ExtractMostSignificantBits:
{
if (compOpportunisticallyDependsOn(InstructionSet_AVX512F) &&
compOpportunisticallyDependsOn(InstructionSet_AVX512BW) &&
compOpportunisticallyDependsOn(InstructionSet_AVX512DQ))
tannergooding marked this conversation as resolved.
Show resolved Hide resolved
{
var_types simdType = getSIMDTypeForSize(simdSize);

op1 = impSIMDPopStack(simdType);

retNode = gtNewSimdHWIntrinsicNode(retType, op1, NI_AVX512F_MoveMaskSpecial, simdBaseJitType, simdSize,
/* isSimdAsHWIntrinsic */ false);
}
break;
}

case NI_Vector128_ExtractMostSignificantBits:
case NI_Vector256_ExtractMostSignificantBits:
{
Expand Down
3 changes: 2 additions & 1 deletion src/coreclr/jit/instr.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,8 @@ const char* CodeGen::genInsDisplayName(emitter::instrDesc* id)
static char buf[4][TEMP_BUFFER_LEN];
const char* retbuf;

if (GetEmitter()->IsVexEncodedInstruction(ins) && !GetEmitter()->IsBMIInstruction(ins))
if (GetEmitter()->IsVexEncodedInstruction(ins) && !GetEmitter()->IsBMIInstruction(ins) &&
!GetEmitter()->IsKInstruction(ins))
{
sprintf_s(buf[curBuf], TEMP_BUFFER_LEN, "v%s", insName);
retbuf = buf[curBuf];
Expand Down
17 changes: 17 additions & 0 deletions src/coreclr/jit/instrsxarch.h
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,18 @@ INST5(dec_l, "dec", IUM_RW, 0x0008FE, BAD_CODE,
// See comment around quarter way through this file for more information.
INST5(bswap, "bswap", IUM_RW, 0x0F00C8, BAD_CODE, BAD_CODE, BAD_CODE, 0x00C80F, INS_TT_NONE, INS_FLAGS_None )

INST5(FIRST_K_INSTRUCTION, "FIRST_K_INSTRUCTION", IUM_WR, BAD_CODE, BAD_CODE, BAD_CODE, BAD_CODE, BAD_CODE, INS_TT_NONE, INS_FLAGS_None )

// id nm um mr mi rm a4 rr tt flags
// TODO-XARCH-AVX512 add the proper W bit switch
INST5(kmovb, "kmovb", IUM_WR, BAD_CODE, BAD_CODE, BAD_CODE, BAD_CODE, PACK3(0x66, 0x0F, 0x92), INS_TT_NONE, INS_FLAGS_Has_Wbit )
INST5(kmovw, "kmovw", IUM_WR, BAD_CODE, BAD_CODE, BAD_CODE, BAD_CODE, PACK2(0x0F, 0x92), INS_TT_NONE, INS_FLAGS_Has_Wbit )
INST5(kmovd, "kmovd", IUM_WR, BAD_CODE, BAD_CODE, BAD_CODE, BAD_CODE, PACK3(0xF2, 0x0F, 0x92), INS_TT_NONE, INS_FLAGS_Has_Wbit )
INST5(kmovq, "kmovq", IUM_WR, BAD_CODE, BAD_CODE, BAD_CODE, BAD_CODE, PACK3(0xF2, 0x0F, 0x92), INS_TT_NONE, INS_FLAGS_Has_Wbit )

INST5(LAST_K_INSTRUCTION, "LAST_K_INSTRUCTION", IUM_WR, BAD_CODE, BAD_CODE, BAD_CODE, BAD_CODE, BAD_CODE, INS_TT_NONE, INS_FLAGS_None )


// id nm um mr mi rm a4 tt flags
INST4(add, "add", IUM_RW, 0x000000, 0x000080, 0x000002, 0x000004, INS_TT_NONE, Writes_OF | Writes_SF | Writes_ZF | Writes_AF | Writes_PF | Writes_CF | INS_FLAGS_Has_Sbit | INS_FLAGS_Has_Wbit )
INST4(or, "or", IUM_RW, 0x000008, 0x000880, 0x00000A, 0x00000C, INS_TT_NONE, Resets_OF | Writes_SF | Writes_ZF | Undefined_AF | Writes_PF | Resets_CF | INS_FLAGS_Has_Sbit | INS_FLAGS_Has_Wbit )
Expand Down Expand Up @@ -648,6 +660,11 @@ INST3(vinsertf32x8, "insertf32x8", IUM_WR, BAD_CODE, BAD_CODE,
INST3(vinserti32x8, "inserti32x8", IUM_WR, BAD_CODE, BAD_CODE, SSE3A(0x3A), INS_TT_TUPLE8, INS_Flags_IsDstDstSrcAVXInstruction) // Insert 256-bit packed quadword integer values
INST3(LAST_AVX512DQ_INSTRUCTION, "LAST_AVX512DQ_INSTRUCTION", IUM_WR, BAD_CODE, BAD_CODE, BAD_CODE, INS_TT_NONE, INS_FLAGS_None)

INST3(vpmovb2m, "vpmovb2m", IUM_WR, BAD_CODE, BAD_CODE, PACK4(0xF3, 0x0F, 0x38, 0x29), INS_TT_NONE, Input_8Bit)
INST3(vpmovw2m, "vpmovw2m", IUM_WR, BAD_CODE, BAD_CODE, PACK4(0xF3, 0x0F, 0x38, 0x29), INS_TT_NONE, Input_16Bit)
INST3(vpmovd2m, "vpmovd2m", IUM_WR, BAD_CODE, BAD_CODE, PACK4(0xF3, 0x0F, 0x38, 0x39), INS_TT_NONE, Input_32Bit)
INST3(vpmovq2m, "vpmovq2m", IUM_WR, BAD_CODE, BAD_CODE, PACK4(0xF3, 0x0F, 0x38, 0x39), INS_TT_NONE, Input_64Bit)

INST3(LAST_AVX512_INSTRUCTION, "LAST_AVX512_INSTRUCTION", IUM_WR, BAD_CODE, BAD_CODE, BAD_CODE, INS_TT_NONE, INS_FLAGS_None)

// Scalar instructions in SSE4.2
Expand Down
9 changes: 9 additions & 0 deletions src/coreclr/jit/lsra.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -698,6 +698,9 @@ LinearScan::LinearScan(Compiler* theCompiler)

availableFloatRegs = RBM_ALLFLOAT;
availableDoubleRegs = RBM_ALLDOUBLE;
#if defined(TARGET_XARCH)
availableMaskRegs = RBM_K1;
tannergooding marked this conversation as resolved.
Show resolved Hide resolved
#endif

#if defined(TARGET_AMD64) || defined(TARGET_ARM64)
if (compiler->opts.compDbgEnC)
Expand Down Expand Up @@ -737,6 +740,12 @@ LinearScan::LinearScan(Compiler* theCompiler)
{
availableRegs[i] = &availableDoubleRegs;
}
#ifdef TARGET_XARCH
else if (thisType == TYP_MASK)
tannergooding marked this conversation as resolved.
Show resolved Hide resolved
{
availableRegs[i] = &availableMaskRegs;
}
Comment on lines +748 to +751
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a general case like the other where integers are the most common, but they're at the bottom of the checked cases for the loop.

Presumably changing this to either a switch or to prioritize setting availableIntRegs as the first path would be beneficial so we don't have "one extra check" before hitting the path.


Actually, looking at this, this entire thing is just very unnecessarily expensive. We could track the classification in the typelist.h so we could do:

#define DEF_TP(tn, nm, jitType, verType, sz, sze, asze, st, al, tf, ar) availableRegs[static_cast<int>(TYP_##tn)] = &ar;
#include "typelist.h"
#undef DEF_TP
    TYP_COUNT

This would involve no loops, no complex checks as to type kind/etc.

@kunalspathak, thoughts? Probably not that impactful overall, since its just part of the LSRA constructor; but still seems "meaningful" given we have around 25 TYP_* defines and this is doing 4-5 checks to initialize 17 of them and 3-4 checks to initialize 6 others.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- Not saying this needs to be handled in this PR, its something we can do as a follow up if its something we want to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated #83109 to take care of it.

#endif // TARGET_XARCH
#endif // FEATURE_SIMD
else
{
Expand Down
Loading