-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-45216: [C++][Compute] Refactor Rank implementation #45217
base: main
Are you sure you want to change the base?
Conversation
ASAN/UBSAN failure in |
2054aca
to
b0ee2d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and thanks for the ping. Just some small questions but I think this is moving in the right direction
while (++it < sorted.non_nulls_end) { | ||
T curr_value = value_selector(*it); | ||
if (curr_value == prev_value) { | ||
*it |= kDuplicateMask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this leading bit never used by it
in the current implementation, or are we just assuming that to be highly unlikely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not, unless someone manages to fit more than 2**63 elements in a Arrow array :)
if constexpr (has_null_like_values<typename ArrayType::TypeClass>()) { | ||
Partitioner partitioner; | ||
if (null_placement == NullPlacement::AtStart) { | ||
auto null_likes_end = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the float types are explicitly looking at nan as a value instead of the mask right? That's something you are looking to align in a follow up? Or its still an open question as to how to handle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this change is actually a small unrelated refactor. It does not change any functionality, it's just moving some of the helper code to more modern C++17 idioms such as if constexpr
:)
Rationale for this change
The Rank implementation currently mixes ties/duplicates detection and rank computation in a single function
CreateRankings
. This makes it poorly reusable for other Rank-like functions such as the Percentile Rank function proposed in GH-45190.What changes are included in this PR?
Split duplicates detection into a dedicated function that sets a marker bit in the sort-indices array (it is private to the Rank implementation, so it is safe to mutate it).
The rank computation itself (
CreateRankings
) becomes simpler and, moreover, it does not need to read the input values: it becomes therefore type-agnostic.This yields a code size reduction (around 45kB saved on the author's machine):
Rank benchmark results are mostly neutral, though there are slight improvements on some benchmarks, and slight regressions especially on all-nulls input.
Are these changes tested?
Yes, by existing tests.
Are there any user-facing changes?
No.