Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45216: [C++][Compute] Refactor Rank implementation #45217

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Jan 9, 2025

Rationale for this change

The Rank implementation currently mixes ties/duplicates detection and rank computation in a single function CreateRankings. This makes it poorly reusable for other Rank-like functions such as the Percentile Rank function proposed in GH-45190.

What changes are included in this PR?

Split duplicates detection into a dedicated function that sets a marker bit in the sort-indices array (it is private to the Rank implementation, so it is safe to mutate it).

The rank computation itself (CreateRankings) becomes simpler and, moreover, it does not need to read the input values: it becomes therefore type-agnostic.

This yields a code size reduction (around 45kB saved on the author's machine):

  • before:
$ size /build/build-release/relwithdebinfo/libarrow.so
   text	   data	    bss	    dec	    hex	filename
26072218	 353832	2567985	28994035	1ba69f3	/build/build-release/relwithdebinfo/libarrow.so
  • after:
$ size /build/build-release/relwithdebinfo/libarrow.so
   text	   data	    bss	    dec	    hex	filename
26028198	 353832	2567985	28950015	1b9bdff	/build/build-release/relwithdebinfo/libarrow.so

Rank benchmark results are mostly neutral, though there are slight improvements on some benchmarks, and slight regressions especially on all-nulls input.

Are these changes tested?

Yes, by existing tests.

Are there any user-facing changes?

No.

@pitrou
Copy link
Member Author

pitrou commented Jan 9, 2025

ASAN/UBSAN failure in arrow-dataset-dataset-writer-test is unrelated.

@pitrou pitrou marked this pull request as ready for review January 9, 2025 16:29
@pitrou pitrou force-pushed the rank-refactor-light branch from 2054aca to b0ee2d5 Compare January 9, 2025 16:38
Copy link
Contributor

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and thanks for the ping. Just some small questions but I think this is moving in the right direction

while (++it < sorted.non_nulls_end) {
T curr_value = value_selector(*it);
if (curr_value == prev_value) {
*it |= kDuplicateMask;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this leading bit never used by it in the current implementation, or are we just assuming that to be highly unlikely?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not, unless someone manages to fit more than 2**63 elements in a Arrow array :)

if constexpr (has_null_like_values<typename ArrayType::TypeClass>()) {
Partitioner partitioner;
if (null_placement == NullPlacement::AtStart) {
auto null_likes_end =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the float types are explicitly looking at nan as a value instead of the mask right? That's something you are looking to align in a follow up? Or its still an open question as to how to handle?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this change is actually a small unrelated refactor. It does not change any functionality, it's just moving some of the helper code to more modern C++17 idioms such as if constexpr :)

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants