GH-44513: [C++] Fix overflow issues for large build side in swiss join #45108

zanmato1984 · 2024-12-25T05:37:49Z

Rationale for this change

#44513 triggers two distinct overflow issues within swiss join, both happening when the build side table contains large enough number of rows or distinct keys. (Cases at this extent of hash join build side are rather rare, so we haven't seen them reported until now):

The first issue is, our swiss table implementation takes the higher N bits of 32-bit hash value as the index to a buffer storing "block"s (a block contains 8 key - in some code also referred to as "group" - ids). This N-bit number is further multiplied by the size of a block, which is also related to N. The N in the case of [C++][Python] Pyarrow.Table.join() breaks on large tables v.18.0.0.dev486 #44513 is 26 and a block takes 40 bytes. So the multiply is possible to produce a number over 1 << 31 (negative when interpreted as signed 32bit). In our AVX2 specialization of accessing the block buffer

arrow/cpp/src/arrow/compute/key_map_internal_avx2.cc

Line 404 in 0a00e25

__m256i group_id = _mm256_i32gather_epi32(elements, pos, 1);

, the issue like [R] Segfault when collecting parquet dataset query results #41813 (comment) shows up. This is the actual issue that directly produced the segfault in [C++][Python] Pyarrow.Table.join() breaks on large tables v.18.0.0.dev486 #44513.
The other issue is, we take 7 bits of the 32-bit hash value after N as a "stamp" (to quick fail the hash comparison). But when N is greater than 25, some arithmetic code like

arrow/cpp/src/arrow/compute/key_map_internal.cc

Line 397 in 0a00e25

static_cast<int>((hash >> (bits_hash_ - log_blocks_ - bits_stamp_)) & stamp_mask);

(bits_hash_ is constexpr 32, log_blocks_ is N, bits_stamp_ is constexpr 7, this is to retrieve the stamp from a hash) produces hash >> -1 aka hash >> 0xFFFFFFFF aka hash >> 31 (the heading 1s are trimmed) then the stamp value is wrong and results in false-mismatched rows. This is the reason of my false positive run in [C++][Python] Pyarrow.Table.join() breaks on large tables v.18.0.0.dev486 #44513 (comment) .

What changes are included in this PR?

For issue 1, use 64-bit index gather intrinsic to avoid the offset overflow.

For issue 2, do not right-shift the hash if N + 7 >= 32. This is actually allowing the bits overlapping between block id (the N bits) and stamp (the 7 bits). Though this may introduce more false-positive hash comparisons (thus worsen the performance), I think this is still more reasonable than brutally failing for N > 25. I introduce two members bits_shift_for_block_and_stamp_ and bits_shift_for_block_, which are derived from log_blocks_ - esp. set to 0 and 32 - N when N + 7 >= 32, this is to avoid branching like if (log_blocks_ + bits_stamp_ > bits_hash_) in tight loops.

Are these changes tested?

The fix is manually tested with the original case in my local. (I do have a concrete C++ UT to verify the fix but it requires too much resource and runs for too long time so it is impractical to run in any reasonable CI environment.)

Are there any user-facing changes?

None.

GitHub Issue: [C++][Python] Pyarrow.Table.join() breaks on large tables v.18.0.0.dev486 #44513

github-actions · 2024-12-25T05:38:13Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2024-12-25T05:38:52Z

⚠️ GitHub issue #44513 has been automatically assigned in GitHub to PR creator.

zanmato1984 · 2024-12-27T11:28:12Z

Hi @pitrou , would you help to take a look? Thanks.

cpp/src/arrow/compute/key_map_internal_avx2.cc

pitrou · 2025-01-06T15:44:18Z

cpp/src/arrow/compute/key_map_internal.h

+  // number of bits to right shift, rather than branching on whether log_blocks_ > 25
+  // every time in tight loops.
+  int bits_shift_for_block_and_stamp_ = bits_hash_ - log_blocks_ - bits_stamp_;
+  int bits_shift_for_block_ = bits_stamp_;


Since the computation is repeated several times, perhaps we can have a short helper function to factor it out? Something like:

static std::pair<int, int> ComputeBitShifts(int log_blocks) { if (log_blocks + bits_stamp_ > bits_hash_) { return {0, bits_hash_ - log_blocks}; } else { return {bits_hash_ - log_blocks - bits_stamp_, bits_stamp_}; } }

Right. Done.

pitrou · 2025-01-06T16:02:06Z

cpp/src/arrow/compute/key_map_internal_avx2.cc

+      // This is to prevent index overflow issues in GH-44513.
+      // NB: Use zero-extend conversion for unsigned hash.
+      __m256i hash_lo = _mm256_cvtepu32_epi64(_mm256_castsi256_si128(hash));
+      __m256i hash_hi = _mm256_cvtepu32_epi64(_mm256_extracti128_si256(hash, 1));
      __m256i local_slot =
          _mm256_set1_epi64x(reinterpret_cast<const uint64_t*>(local_slots)[i]);
      local_slot = _mm256_shuffle_epi8(


Hmm... so this first expands _mm256_shuffle_epi8 from 8-bit to 32-bit lanes, and then _mm256_cvtepi32_epi64 below expands it from 32-bit to 64-bit lanes? Would it be quicker to shuffle directly from 8-bit to 64-bit (twice, I suppose)

(interestingly, _mm256_shuffle_epi8 is faster than _mm256_cvtepi32_epi64 according to https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_shuffle_epi8&ig_expand=1798,6006,1628,6006)

I was thinking that we can save one multiply of local_offset * byte_size. But yeah, once we shuffled to 64-bit lanes, we can use _mm256_mul_epi32 (5 cycles) to replace _mm256_mullo_epi32 (10 cycles), then we have 2 _mm256_shuffle_epi8s (1 cycle each) + 2 _mm256_mul_epi32s = 12 cycles in total, VS., 1 _mm256_shuffle_epi8 + 1 _mm256_mullo_epi32 + 2 _mm256_cvtepi32_epi64 (3 cycles each) = 17 cycles in total, which is still a win.

I've updated. Thank you for this.

pitrou · 2025-01-06T16:05:25Z

cpp/src/arrow/compute/key_map_internal_avx2.cc

+      __m256i local_slot_hi =
+          _mm256_cvtepi32_epi64(_mm256_extracti128_si256(local_slot, 1));
+      __m256i pos_lo =
+          _mm256_srlv_epi64(hash_lo, _mm256_set1_epi64x(bits_hash_ - log_blocks_));


By the way, why not _mm256_srli_epi64(hash_lo, bits_hash_ - log_blocks_)?

Just copied from the original code, plus I wasn't aware of _mm256_srli_epi64 then - still learning :)

Updated here and a couple of other unnecessary vector shifting. Thank you!

pitrou · 2025-01-06T16:07:28Z

cpp/src/arrow/compute/key_map_internal_avx2.cc

+      pos_lo = _mm256_mul_epi32(pos_lo, _mm256_set1_epi32(byte_multiplier));
+      pos_hi = _mm256_mul_epi32(pos_hi, _mm256_set1_epi32(byte_multiplier));


For the record, why are we multiplying in the signed domain rather than unsigned?

Yeah we should use unsigned multiply.

But actually they are the same in this specific case (i.e., both operands are less than 0x80000000 - note the log_blocks_ is strictly less than 32). Even the result is larger than uint32_max, _mm256_mul_epi32 won't do sign-extension.

Anyway, I'll update. Thank you.

zanmato1984 · 2025-01-07T03:18:37Z

@ursabot please benchmark

zanmato1984 · 2025-01-07T06:40:35Z

@ursabot please benchmark

ursabot · 2025-01-07T06:40:39Z

Commit 4462ceb already has scheduled benchmark runs.

conbench-apache-arrow · 2025-01-09T02:37:44Z

Thanks for your patience. Conbench analyzed the 3 benchmarking runs that have been run so far on PR commit 4462ceb.

There were 29 benchmark results with an error:

Pull Request Run on amd64-m5-4xlarge-linux at 2025-01-08 19:19:03Z
- tpch
- tpch
and 27 more (see the report linked below)

There weren't enough matching historic benchmark results to make a call on whether there were regressions.

The full Conbench report has more details.

zanmato1984 added 16 commits December 25, 2024 13:28

Repro

3a3b23c

Fix

d933802

Update test

a54025b

Fix and test WIP

395ad09

Fix

c756f08

Comment test

d33b488

Fix

c824983

Fix

53fe09f

Fix

74ff314

Hack

e79cca3

Fix done

289494f

Fix avx2 path index overflow

936b72a

Fix avx2 path index overflow

71350f3

Fix avx2 path index overflow

3ffedd5

Fix sign-extend for hash 32 to 64

13b69a7

Update test

af07e9b

zanmato1984 requested a review from westonpace as a code owner December 25, 2024 05:37

zanmato1984 marked this pull request as draft December 25, 2024 05:37

github-actions bot added Component: C++ awaiting review Awaiting review labels Dec 25, 2024

zanmato1984 changed the title ~~[C++] Fix for large build side in swiss join~~ GH-44513: [C++] Fix for large build side in swiss join Dec 25, 2024

zanmato1984 changed the title ~~GH-44513: [C++] Fix for large build side in swiss join~~ GH-44513: [C++] Fix overflow issues for large build side in swiss join Dec 26, 2024

zanmato1984 added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Dec 26, 2024

Some cleanup

fe35443

zanmato1984 force-pushed the fix-gh44513 branch from 4bc9967 to fe35443 Compare December 27, 2024 11:01

Comment about new variables

2913796

zanmato1984 marked this pull request as ready for review December 27, 2024 11:27

More comments

14a0a1d

zanmato1984 commented Dec 27, 2024

View reviewed changes

cpp/src/arrow/compute/key_map_internal_avx2.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 27, 2024

pitrou reviewed Jan 6, 2025

View reviewed changes

zanmato1984 added 5 commits January 7, 2025 08:51

Replace vector shifting with immediate number version

c914689

Directly shuffle 8 bytes to 8 64-bit lanes

86adc3d

Refine comment

8762886

Factor out computation of bits to right shift into functions

37c3948

Change signed mulitply to unsigned

4462ceb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-44513: [C++] Fix overflow issues for large build side in swiss join #45108

GH-44513: [C++] Fix overflow issues for large build side in swiss join #45108

zanmato1984 commented Dec 25, 2024 •

edited

Loading

github-actions bot commented Dec 25, 2024

github-actions bot commented Dec 25, 2024

zanmato1984 commented Dec 27, 2024

pitrou Jan 6, 2025

zanmato1984 Jan 7, 2025

pitrou Jan 6, 2025

zanmato1984 Jan 7, 2025

pitrou Jan 6, 2025

zanmato1984 Jan 7, 2025

pitrou Jan 6, 2025

zanmato1984 Jan 7, 2025

zanmato1984 Jan 7, 2025

zanmato1984 commented Jan 7, 2025

zanmato1984 commented Jan 7, 2025

ursabot commented Jan 7, 2025

conbench-apache-arrow bot commented Jan 9, 2025

		pos_lo = _mm256_mul_epi32(pos_lo, _mm256_set1_epi32(byte_multiplier));
		pos_hi = _mm256_mul_epi32(pos_hi, _mm256_set1_epi32(byte_multiplier));

GH-44513: [C++] Fix overflow issues for large build side in swiss join #45108

Are you sure you want to change the base?

GH-44513: [C++] Fix overflow issues for large build side in swiss join #45108

Conversation

zanmato1984 commented Dec 25, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Dec 25, 2024

github-actions bot commented Dec 25, 2024

zanmato1984 commented Dec 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zanmato1984 commented Jan 7, 2025

zanmato1984 commented Jan 7, 2025

ursabot commented Jan 7, 2025

conbench-apache-arrow bot commented Jan 9, 2025

zanmato1984 commented Dec 25, 2024 •

edited

Loading