Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44513: [C++] Fix overflow issues for large build side in swiss join #45108

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

zanmato1984
Copy link
Contributor

@zanmato1984 zanmato1984 commented Dec 25, 2024

Rationale for this change

#44513 triggers two distinct overflow issues within swiss join, both happening when the build side table contains large enough number of rows or distinct keys. (Cases at this extent of hash join build side are rather rare, so we haven't seen them reported until now):

  1. The first issue is, our swiss table implementation takes the higher N bits of 32-bit hash value as the index to a buffer storing "block"s (a block contains 8 key - in some code also referred to as "group" - ids). This N-bit number is further multiplied by the size of a block, which is also related to N. The N in the case of [C++][Python] Pyarrow.Table.join() breaks on large tables v.18.0.0.dev486 #44513 is 26 and a block takes 40 bytes. So the multiply is possible to produce a number over 1 << 31 (negative when interpreted as signed 32bit). In our AVX2 specialization of accessing the block buffer
    __m256i group_id = _mm256_i32gather_epi32(elements, pos, 1);
    , the issue like [R] Segfault when collecting parquet dataset query results #41813 (comment) shows up. This is the actual issue that directly produced the segfault in [C++][Python] Pyarrow.Table.join() breaks on large tables v.18.0.0.dev486 #44513.
  2. The other issue is, we take 7 bits of the 32-bit hash value after N as a "stamp" (to quick fail the hash comparison). But when N is greater than 25, some arithmetic code like
    static_cast<int>((hash >> (bits_hash_ - log_blocks_ - bits_stamp_)) & stamp_mask);
    (bits_hash_ is constexpr 32, log_blocks_ is N, bits_stamp_ is constexpr 7, this is to retrieve the stamp from a hash) produces hash >> -1 aka hash >> 0xFFFFFFFF aka hash >> 31 (the heading 1s are trimmed) then the stamp value is wrong and results in false-mismatched rows. This is the reason of my false positive run in [C++][Python] Pyarrow.Table.join() breaks on large tables v.18.0.0.dev486 #44513 (comment) .

What changes are included in this PR?

For issue 1, use 64-bit index gather intrinsic to avoid the offset overflow.

For issue 2, do not right-shift the hash if N + 7 >= 32. This is actually allowing the bits overlapping between block id (the N bits) and stamp (the 7 bits). Though this may introduce more false-positive hash comparisons (thus worsen the performance), I think this is still more reasonable than brutally failing for N > 25. I introduce two members bits_shift_for_block_and_stamp_ and bits_shift_for_block_, which are derived from log_blocks_ - esp. set to 0 and 32 - N when N + 7 >= 32, this is to avoid branching like if (log_blocks_ + bits_stamp_ > bits_hash_) in tight loops.

Are these changes tested?

The fix is manually tested with the original case in my local. (I do have a concrete C++ UT to verify the fix but it requires too much resource and runs for too long time so it is impractical to run in any reasonable CI environment.)

Are there any user-facing changes?

None.

@zanmato1984 zanmato1984 marked this pull request as draft December 25, 2024 05:37
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@zanmato1984 zanmato1984 changed the title [C++] Fix for large build side in swiss join GH-44513: [C++] Fix for large build side in swiss join Dec 25, 2024
Copy link

⚠️ GitHub issue #44513 has been automatically assigned in GitHub to PR creator.

@zanmato1984 zanmato1984 changed the title GH-44513: [C++] Fix for large build side in swiss join GH-44513: [C++] Fix overflow issues for large build side in swiss join Dec 26, 2024
@zanmato1984 zanmato1984 added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Dec 26, 2024
@zanmato1984 zanmato1984 marked this pull request as ready for review December 27, 2024 11:27
@zanmato1984
Copy link
Contributor Author

Hi @pitrou , would you help to take a look? Thanks.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 27, 2024
// number of bits to right shift, rather than branching on whether log_blocks_ > 25
// every time in tight loops.
int bits_shift_for_block_and_stamp_ = bits_hash_ - log_blocks_ - bits_stamp_;
int bits_shift_for_block_ = bits_stamp_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the computation is repeated several times, perhaps we can have a short helper function to factor it out? Something like:

  static std::pair<int, int> ComputeBitShifts(int log_blocks) {
    if (log_blocks + bits_stamp_ > bits_hash_) {
      return {0, bits_hash_ - log_blocks};
    } else {
      return {bits_hash_ - log_blocks - bits_stamp_, bits_stamp_};
    }
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Done.

// This is to prevent index overflow issues in GH-44513.
// NB: Use zero-extend conversion for unsigned hash.
__m256i hash_lo = _mm256_cvtepu32_epi64(_mm256_castsi256_si128(hash));
__m256i hash_hi = _mm256_cvtepu32_epi64(_mm256_extracti128_si256(hash, 1));
__m256i local_slot =
_mm256_set1_epi64x(reinterpret_cast<const uint64_t*>(local_slots)[i]);
local_slot = _mm256_shuffle_epi8(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... so this first expands _mm256_shuffle_epi8 from 8-bit to 32-bit lanes, and then _mm256_cvtepi32_epi64 below expands it from 32-bit to 64-bit lanes? Would it be quicker to shuffle directly from 8-bit to 64-bit (twice, I suppose)

(interestingly, _mm256_shuffle_epi8 is faster than _mm256_cvtepi32_epi64 according to https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_shuffle_epi8&ig_expand=1798,6006,1628,6006)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that we can save one multiply of local_offset * byte_size. But yeah, once we shuffled to 64-bit lanes, we can use _mm256_mul_epi32 (5 cycles) to replace _mm256_mullo_epi32 (10 cycles), then we have 2 _mm256_shuffle_epi8s (1 cycle each) + 2 _mm256_mul_epi32s = 12 cycles in total, VS., 1 _mm256_shuffle_epi8 + 1 _mm256_mullo_epi32 + 2 _mm256_cvtepi32_epi64 (3 cycles each) = 17 cycles in total, which is still a win.

I've updated. Thank you for this.

__m256i local_slot_hi =
_mm256_cvtepi32_epi64(_mm256_extracti128_si256(local_slot, 1));
__m256i pos_lo =
_mm256_srlv_epi64(hash_lo, _mm256_set1_epi64x(bits_hash_ - log_blocks_));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, why not _mm256_srli_epi64(hash_lo, bits_hash_ - log_blocks_)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just copied from the original code, plus I wasn't aware of _mm256_srli_epi64 then - still learning :)

Updated here and a couple of other unnecessary vector shifting. Thank you!

Comment on lines 414 to 415
pos_lo = _mm256_mul_epi32(pos_lo, _mm256_set1_epi32(byte_multiplier));
pos_hi = _mm256_mul_epi32(pos_hi, _mm256_set1_epi32(byte_multiplier));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, why are we multiplying in the signed domain rather than unsigned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we should use unsigned multiply.

But actually they are the same in this specific case (i.e., both operands are less than 0x80000000 - note the log_blocks_ is strictly less than 32). Even the result is larger than uint32_max, _mm256_mul_epi32 won't do sign-extension.

Anyway, I'll update. Thank you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@zanmato1984
Copy link
Contributor Author

@ursabot please benchmark

1 similar comment
@zanmato1984
Copy link
Contributor Author

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Jan 7, 2025

Commit 4462ceb already has scheduled benchmark runs.

Copy link

Thanks for your patience. Conbench analyzed the 3 benchmarking runs that have been run so far on PR commit 4462ceb.

There were 29 benchmark results with an error:

There weren't enough matching historic benchmark results to make a call on whether there were regressions.

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting committer review Awaiting committer review Component: C++ Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants