Skip to content

Commit

Permalink
Merge branch 'dev'
Browse files Browse the repository at this point in the history
  • Loading branch information
herumi committed Oct 29, 2024
2 parents cf209c9 + 565ad4e commit 97b6611
Show file tree
Hide file tree
Showing 15 changed files with 222 additions and 63 deletions.
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
cmake_minimum_required(VERSION 3.5)

project(xbyak LANGUAGES CXX VERSION 7.20)
project(xbyak LANGUAGES CXX VERSION 7.20.1)

file(GLOB headers xbyak/*.h)

Expand Down
1 change: 1 addition & 0 deletions doc/changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# History

* 2024/Oct/17 ver 7.20.1 Updated to comply with AVX10.2 specification rev 2.0
* 2024/Oct/15 ver 7.20 Fixed the specification of setDefaultEncoding, setDefaultEncodingAVX10.
* 2024/Oct/15 ver 7.11 Added full support for AVX10.2
* 2024/Oct/13 ver 7.10 support AVX10 integer and fp16 vnni, media new instructions. setDefaultEncoding is extended.
Expand Down
30 changes: 18 additions & 12 deletions doc/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,15 @@ vfpclasspd k5{k3}, [rax+64]{1to2}, 5 --> vfpclasspd(k5|k3, xword_b [rax+64],
vfpclassps k5{k3}, [rax+64]{1to4}, 5 --> vfpclassps(k5|k3, yword_b [rax+64], 5); // broadcast 64-bit to 256-bit
```
### Remark
* `k1`, ..., `k7` are opmask registers.
- `k0` is dealt as no mask.
- e.g. `vmovaps(zmm0|k0, ptr[rax]);` and `vmovaps(zmm0|T_z, ptr[rax]);` are same to `vmovaps(zmm0, ptr[rax]);`.
* use `| T_z`, `| T_sae`, `| T_rn_sae`, `| T_rd_sae`, `| T_ru_sae`, `| T_rz_sae` instead of `,{z}`, `,{sae}`, `,{rn-sae}`, `,{rd-sae}`, `,{ru-sae}`, `,{rz-sae}` respectively.
* `k4 | k3` is different from `k3 | k4`.
* use `ptr_b` for broadcast `{1toX}`. X is automatically determined.
* specify `xword`/`yword`/`zword(_b)` for m128/m256/m512 if necessary.
## Selecting AVX512-VNNI, AVX-VNNI, AVX-VNNI-INT8, AVX10.2.
Some mnemonics have some types of encodings: VEX, EVEX, AVX10.2.
The functions for these mnemonics include an optional parameter as the last argument to specify the encoding.
Expand Down Expand Up @@ -145,20 +154,17 @@ feature|AVX512-VNNI|AVX-VNNI
-|-|-
feature|AVX-VNNI-INT8, AVX512-FP16|AVX10.2

- Target functions: vmpsadbw, vpdpbssd, vpdpbssds, vpdpbsud, vpdpbsuds, vpdpbuud, vpdpbuuds, vpdpwsud vpdpwsuds vpdpwusd vpdpwusds vpdpwuud, vpdpwuuds, vmovd, vmovw

- Remark: vmovd and vmovw several kinds of encoding such as AVX/AVX512F/AVX512-FP16/AVX10.2.
At first, I attempted to use EvexEncoding (resp. VexEncoding) instead of AVX10v2Encoding (resp. EvexEncoding) for `setDefaultEncodingAVX10`.
But I abandoned this idea when I found that `vmovd` and `vmovw` had different EVEX encodings in AVX512 and AVX10.2
- Target functions: vmpsadbw, vpdpbssd, vpdpbssds, vpdpbsud, vpdpbsuds, vpdpbuud, vpdpbuuds, vpdpwsud vpdpwsuds vpdpwusd vpdpwusds vpdpwuud, vpdpwuuds and vmovd, vmovw with MEM-to-MEM.

### Remark
* `k1`, ..., `k7` are opmask registers.
- `k0` is dealt as no mask.
- e.g. `vmovaps(zmm0|k0, ptr[rax]);` and `vmovaps(zmm0|T_z, ptr[rax]);` are same to `vmovaps(zmm0, ptr[rax]);`.
* use `| T_z`, `| T_sae`, `| T_rn_sae`, `| T_rd_sae`, `| T_ru_sae`, `| T_rz_sae` instead of `,{z}`, `,{sae}`, `,{rn-sae}`, `,{rd-sae}`, `,{ru-sae}`, `,{rz-sae}` respectively.
* `k4 | k3` is different from `k3 | k4`.
* use `ptr_b` for broadcast `{1toX}`. X is automatically determined.
* specify `xword`/`yword`/`zword(_b)` for m128/m256/m512 if necessary.

1. `vmovd` and `vmovw` instructions with REG-to-XMM or XMM-to-REG operands are always encoded using AVX10.1.
When used with XMM-to-XMM operands, these instructions are always encoded using AVX10.2.

2. `vmovd` and `vmovw` instructions with XMM-to-MEM or MEM-to-XMM operands support multiple encoding formats, including AVX, AVX512F, AVX512-FP16, and AVX10.2.

Initially, I tried implementing `setDefaultEncodingAVX10` using `EvexEncoding` (resp. `VexEncoding`) instead of `AVX10v2Encoding` (resp. `EvexEncoding`).
However, I abandoned this approach after discovering the complexity of the encoding requirements of `vmovd` and `vmovw`.

## APX
[Advanced Performance Extensions (APX) Architecture Specification](https://www.intel.com/content/www/us/en/content-details/786223/intel-advanced-performance-extensions-intel-apx-architecture-specification.html)
Expand Down
14 changes: 7 additions & 7 deletions gen/gen_avx512.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -202,13 +202,13 @@ void putX_XM()
{ 0x2F, "vcomish", T_MUST_EVEX | T_MAP5 | T_EW0 | T_SAE_X | T_N2 },
{ 0x2E, "vucomish", T_MUST_EVEX | T_MAP5 | T_EW0 | T_SAE_X | T_N2 },

{ 0x2F, "vcomxsd", T_MUST_EVEX | T_F3 | T_0F | T_EW1 | T_SAE_X | T_N8 },
{ 0x2F, "vcomxsh", T_MUST_EVEX | T_F2 | T_MAP5 | T_EW0 | T_SAE_X | T_N2 },
{ 0x2F, "vcomxss", T_MUST_EVEX | T_F2 | T_0F | T_EW0 | T_SAE_X | T_N4 },
{ 0x2F, "vcomxsd", T_MUST_EVEX | T_F2 | T_0F | T_EW1 | T_SAE_X | T_N8 },
{ 0x2F, "vcomxsh", T_MUST_EVEX | T_F3 | T_MAP5 | T_EW0 | T_SAE_X | T_N2 },
{ 0x2F, "vcomxss", T_MUST_EVEX | T_F3 | T_0F | T_EW0 | T_SAE_X | T_N4 },

{ 0x2E, "vucomxsd", T_MUST_EVEX | T_F3 | T_0F | T_EW1 | T_SAE_X | T_N8 },
{ 0x2E, "vucomxsh", T_MUST_EVEX | T_F2 | T_MAP5 | T_EW0 | T_SAE_X | T_N2 },
{ 0x2E, "vucomxss", T_MUST_EVEX | T_F2 | T_0F | T_EW0 | T_SAE_X | T_N4 },
{ 0x2E, "vucomxsd", T_MUST_EVEX | T_F2 | T_0F | T_EW1 | T_SAE_X | T_N8 },
{ 0x2E, "vucomxsh", T_MUST_EVEX | T_F3 | T_MAP5 | T_EW0 | T_SAE_X | T_N2 },
{ 0x2E, "vucomxss", T_MUST_EVEX | T_F3 | T_0F | T_EW0 | T_SAE_X | T_N4 },

// 13.1
{ 0x69, "vcvtnebf162ibs", T_MUST_EVEX | T_YMM | T_F2 | T_MAP5 | T_EW0 | T_B16 },
Expand Down Expand Up @@ -893,7 +893,7 @@ void putX_XM_IMM()
{ 0x62, "vpexpandw", T_66 | T_0F38 | T_YMM | T_MUST_EVEX | T_EW1 | T_SAE_Z | T_N2, false },

{ 0x2F, "vcomsbf16", T_MUST_EVEX | T_66 | T_MAP5 | T_EW0 | T_N2, false },
{ 0x42, "vgetexppbf16", T_MUST_EVEX | T_66 | T_MAP5 | T_EW0 | T_YMM | T_B16, false },
{ 0x42, "vgetexppbf16", T_MUST_EVEX | T_MAP6 | T_EW0 | T_YMM | T_B16, false },
{ 0x26, "vgetmantpbf16", T_MUST_EVEX | T_F2 | T_0F3A | T_EW0 | T_YMM | T_B16, true },
{ 0x4C, "vrcppbf16", T_MUST_EVEX | T_MAP6 | T_EW0 | T_YMM | T_B16, false },
{ 0x56, "vreducenepbf16", T_MUST_EVEX | T_F2 | T_0F3A | T_EW0 | T_YMM | T_B16, true },
Expand Down
2 changes: 1 addition & 1 deletion meson.build
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
project(
'xbyak',
'cpp',
version: '7.20',
version: '7.20.1',
license: 'BSD-3-Clause',
default_options: 'b_ndebug=if-release'
)
Expand Down
2 changes: 1 addition & 1 deletion readme.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

# Xbyak 7.20 [![Badge Build]][Build Status]
# Xbyak 7.20.1 [![Badge Build]][Build Status]

*A JIT assembler for x86/x64 architectures supporting advanced instruction sets up to AVX10.2*

Expand Down
3 changes: 2 additions & 1 deletion readme.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

C++用x86(IA-32), x64(AMD64, x86-64) JITアセンブラ Xbyak 7.20
C++用x86(IA-32), x64(AMD64, x86-64) JITアセンブラ Xbyak 7.20.1

-----------------------------------------------------------------------------
◎概要
Expand Down Expand Up @@ -404,6 +404,7 @@ sample/{echo,hello}.bfは http://www.kmonos.net/alang/etc/brainfuck.php から
-----------------------------------------------------------------------------
◎履歴

2024/10/17 ver 7.20.1 AVX10.2 rev 2.0仕様書の変更に追従
2024/10/15 ver 7.20 setDefaultEncoding/setDefaultEncodingAVX10の仕様確定
2024/10/15 ver 7.11 AVX10.2完全サポート
2024/10/13 ver 7.10 AVX10 integer and fp16 vnni, mediaの新命令対応. setDefaultEncodingの拡張.
Expand Down
3 changes: 2 additions & 1 deletion test/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,8 @@ apx: apx.cpp $(XBYAK_INC)
avx10_test: avx10_test.cpp $(XBYAK_INC)
$(CXX) $(CFLAGS) avx10_test.cpp -o $@ -DXBYAK64

TEST_FILES=old.txt new-ymm.txt bf16.txt comp.txt misc.txt convert.txt minmax.txt saturation.txt
#TEST_FILES=old.txt new-ymm.txt bf16.txt comp.txt misc.txt convert.txt minmax.txt saturation.txt
TEST_FILES=old.txt new-ymm.txt bf16.txt misc.txt convert.txt minmax.txt saturation.txt
xed_test:
@set -e; \
for target in $(addprefix avx10/, $(TEST_FILES)); do \
Expand Down
18 changes: 9 additions & 9 deletions test/avx10/bf16.txt
Original file line number Diff line number Diff line change
Expand Up @@ -113,17 +113,17 @@ vfpclasspbf16(k7|k5, zword_b[rax+128], 13);
vcomsbf16(xm2, xm3);
vcomsbf16(xm2, ptr[rax+128]);

vgetexppbf16(xm1|k3, xmm2);
vgetexppbf16(xm1|k3, ptr[rax+128]);
vgetexppbf16(xm1|k3, ptr_b[rax+128]);
//vgetexppbf16(xm1|k3, xmm2);
//vgetexppbf16(xm1|k3, ptr[rax+128]);
//vgetexppbf16(xm1|k3, ptr_b[rax+128]);

vgetexppbf16(ym1|k3, ymm2);
vgetexppbf16(ym1|k3, ptr[rax+128]);
vgetexppbf16(ym1|k3, ptr_b[rax+128]);
//vgetexppbf16(ym1|k3, ymm2);
//vgetexppbf16(ym1|k3, ptr[rax+128]);
//vgetexppbf16(ym1|k3, ptr_b[rax+128]);

vgetexppbf16(zm1|k3, zmm2);
vgetexppbf16(zm1|k3, ptr[rax+128]);
vgetexppbf16(zm1|k3, ptr_b[rax+128]);
//vgetexppbf16(zm1|k3, zmm2);
//vgetexppbf16(zm1|k3, ptr[rax+128]);
//vgetexppbf16(zm1|k3, ptr_b[rax+128]);

vgetmantpbf16(xm1|k3, xmm2, 3);
vgetmantpbf16(xm1|k3, ptr[rax+128], 5);
Expand Down
96 changes: 96 additions & 0 deletions test/misc.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2284,4 +2284,100 @@ CYBOZU_TEST_AUTO(avx_vnni_int)
CYBOZU_TEST_EQUAL_ARRAY(c.getCode(), tbl, n);
}

CYBOZU_TEST_AUTO(vmovd)
{
struct Code : Xbyak::CodeGenerator {
Code()
{
setDefaultEncodingAVX10(PreAVX10v2Encoding);
vmovd(eax, xm1); // always AVX10.1
vmovd(xm1, eax); // always AVX10.1
vmovd(xm3, xm1); // always AVX10.2
// AVX-512 (AVX10.1)
vmovd(ptr[rax+128], xm1);
vmovd(xm1, ptr[rax+128]);
vmovd(ptr[rax+128], xm30);
vmovd(xm30, ptr[rax+128]);

setDefaultEncodingAVX10(AVX10v2Encoding);
vmovd(eax, xm1); // always AVX10.1
vmovd(xm1, eax); // always AVX10.1
vmovd(xm3, xm1); // always AVX10.2
// AVX10.2
vmovd(ptr[rax+128], xm1);
vmovd(xm1, ptr[rax+128]);
vmovd(ptr[rax+128], xm30);
vmovd(xm30, ptr[rax+128]);
}
} c;
const uint8_t tbl[] = {
0xc5, 0xf9, 0x7e, 0xc8, // avx10.1
0xc5, 0xf9, 0x6e, 0xc8, // avx10.1
0x62, 0xf1, 0x7e, 0x08, 0x7e, 0xd9, // avx10.2
0xc5, 0xf9, 0x7e, 0x88, 0x80, 0x00, 0x00, 0x00, // avx
0xc5, 0xf9, 0x6e, 0x88, 0x80, 0x00, 0x00, 0x00, // avx
0x62, 0x61, 0x7d, 0x08, 0x7e, 0x70, 0x20, // avx10.1
0x62, 0x61, 0x7d, 0x08, 0x6e, 0x70, 0x20, // avx10.1

0xc5, 0xf9, 0x7e, 0xc8, // avx10.1
0xc5, 0xf9, 0x6e, 0xc8, // avx10.1
0x62, 0xf1, 0x7e, 0x08, 0x7e, 0xd9, // avx10.2
0x62, 0xf1, 0x7d, 0x08, 0xd6, 0x48, 0x20, // avx10.2
0x62, 0xf1, 0x7e, 0x08, 0x7e, 0x48, 0x20, // avx10.2
0x62, 0x61, 0x7d, 0x08, 0xd6, 0x70, 0x20, // avx10.2
0x62, 0x61, 0x7e, 0x08, 0x7e, 0x70, 0x20, // avx10.2
};
const size_t n = sizeof(tbl) / sizeof(tbl[0]);
CYBOZU_TEST_EQUAL(c.getSize(), n);
CYBOZU_TEST_EQUAL_ARRAY(c.getCode(), tbl, n);
}

CYBOZU_TEST_AUTO(vmovw)
{
struct Code : Xbyak::CodeGenerator {
Code()
{
setDefaultEncodingAVX10(PreAVX10v2Encoding);
vmovw(eax, xm1); // always avx10.1
vmovw(xm1, eax); // always avx10.1
vmovw(xm3, xm1); // always avx10.2
// AVX10.1
vmovw(ptr[rax+128], xm1);
vmovw(xm1, ptr[rax+128]);
vmovw(ptr[rax+128], xm30);
vmovw(xm30, ptr[rax+128]);

setDefaultEncodingAVX10(AVX10v2Encoding);
vmovw(eax, xm1); // always avx10.1
vmovw(xm1, eax); // always avx10.1
vmovw(xm3, xm1); // always avx10.2
// AVX10.2
vmovw(ptr[rax+128], xm1);
vmovw(xm1, ptr[rax+128]);
vmovw(ptr[rax+128], xm30);
vmovw(xm30, ptr[rax+128]);
}
} c;
const uint8_t tbl[] = {
0x62, 0xf5, 0x7d, 0x08, 0x7e, 0xc8,
0x62, 0xf5, 0x7d, 0x08, 0x6e, 0xc8,
0x62, 0xf5, 0x7e, 0x08, 0x6e, 0xd9,
0x62, 0xf5, 0x7d, 0x08, 0x7e, 0x48, 0x40,
0x62, 0xf5, 0x7d, 0x08, 0x6e, 0x48, 0x40,
0x62, 0x65, 0x7d, 0x08, 0x7e, 0x70, 0x40,
0x62, 0x65, 0x7d, 0x08, 0x6e, 0x70, 0x40,

0x62, 0xf5, 0x7d, 0x08, 0x7e, 0xc8,
0x62, 0xf5, 0x7d, 0x08, 0x6e, 0xc8,
0x62, 0xf5, 0x7e, 0x08, 0x6e, 0xd9,
0x62, 0xf5, 0x7e, 0x08, 0x7e, 0x48, 0x40,
0x62, 0xf5, 0x7e, 0x08, 0x6e, 0x48, 0x40,
0x62, 0x65, 0x7e, 0x08, 0x7e, 0x70, 0x40,
0x62, 0x65, 0x7e, 0x08, 0x6e, 0x70, 0x40,
};
const size_t n = sizeof(tbl) / sizeof(tbl[0]);
CYBOZU_TEST_EQUAL(c.getSize(), n);
CYBOZU_TEST_EQUAL_ARRAY(c.getCode(), tbl, n);
}

#endif
6 changes: 6 additions & 0 deletions test/test_by_xed.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
@echo off
set CFLAGS=-I ../ /EHsc /nologo
copy %1% tmp.cpp
cl %CFLAGS% test_by_xed.cpp && test_by_xed.exe
%XED% -64 -ir bin > out.txt
python3 test_by_xed.py %1% out.txt
Loading

0 comments on commit 97b6611

Please sign in to comment.