-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-44393: [C++][Compute] Vector selection functions inverse_permutation
and scatter
#44394
Conversation
|
Hi, @felipecrv @pitrou . Could you help to review this? I think this is a necessary building block for special forms (#41094 (comment)). Much appreciated! |
@zanmato1984 I'm not doing code reviews with the same frequency lately, but I might take a look at this one during the week. |
cpp/src/arrow/compute/api_vector.h
Outdated
/// | ||
/// For indices[i] = x, reverse_indices[x] = i. And reverse_indices[x] = null if x does | ||
/// not appear in the input indices. For indices[i] = x where x < 0 or x >= output_length, | ||
/// it is ignored. If multiple indices point to the same value, the last one is used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this explanation is confusing, but we can work on this later.
Hi @felipecrv , the renaming is done. Would you like to proceed with the review? Thank you. |
Hi @pitrou @felipecrv @mapleFU , shall we move on with this? I'll need this for my new (and possibly formal) implementation of special form. Also cc @bkietz . Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @zanmato1984 and sorry for the delay in reviewing. You'll find some comments below.
Also, can you update this PR with recent git main and fix any conflicts?
Hi @pitrou , you will see my commits and comments one by one, this is me addressing each of your comments. I'll explicitly let you know once they are all done. Before that, you can just ignore the notification. Thank you for the review, appreciate that! |
Hi @pitrou , I think I've addressed all the comments before. Would you take a second look? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really sorry for the delay @zanmato1984 . Here is another review, but this mostly LGTM.
docs/source/cpp/compute.rst
Outdated
@@ -1929,3 +1929,28 @@ operation to the n-th and (n+abs(p))-th inputs. | |||
``Subtract``. The period can be specified in :struct:`PairwiseOptions`. | |||
* \(2) Wraps around the result when overflow is detected. | |||
* \(3) Returns an ``Invalid`` :class:`Status` when overflow is detected. | |||
|
|||
Swizzle functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of creating a category for this, perhaps we should just document these functions as part of the "Selections" category? After all, "scatter" is quite similar to "take" except that it takes a reverse mapping of the indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
Do you think if I can keep the term "swizzle" in the source code? It's used in source file names, comments, RegisterVectorSwizzle
function name. I wanted to move them into the "selection" counterparts but those attempts just complicated the code organization pretty bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's ok to use "swizzle" in the source code IMHO.
DCHECK_EQ(validity_buf_, nullptr); | ||
|
||
ARROW_ASSIGN_OR_RAISE(validity_buf_, | ||
AllocateEmptyBitmap(output_length_, ctx_->memory_pool())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that AllocateEmptyBitmap
will already memset to 0, you may want to allocate an uninitialized buffer instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
ARROW_ASSIGN_OR_RAISE(validity_buf_, | ||
AllocateEmptyBitmap(output_length_, ctx_->memory_pool())); | ||
auto validity = validity_buf_->mutable_data_as<uint8_t>(); | ||
std::memset(validity, valid ? 0xff : 0, validity_buf_->size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: if allocating an uninitialized buffer, should use validity_buf_->capacity()
to make sure that padding bits are initialized too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very useful advice! Addressed.
static const std::vector<std::shared_ptr<DataType>> kSignedIntegerTypes = { | ||
int8(), int16(), int32(), int64()}; | ||
|
||
static const std::vector<std::shared_ptr<DataType>> kBinaryTypes = { | ||
binary(), utf8(), large_binary(), large_utf8()}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be surprised if we didn't already have functions to return these sets of types. Perhaps SignedIntTypes()
and BaseBinaryTypes()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Addressed.
} | ||
|
||
TEST(Scatter, Numeric) { | ||
for (const auto& type : kSignedIntegerTypes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: call this value_type
for clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also, why not test all number types: integers + floats?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both addressed.
I was worrying our ArrayFromJSON
isn't that robust (to parse integer into float). But it actually does the trick!
/// I = indices | ||
/// m = max_index | ||
/// I' = ReplaceWithMask(I, i > m, null) | ||
/// I'' = ReplaceWithMask(I, i < 0, null) + m + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the i < 0
and i > m
conditions necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they are not. Thank you.
They are introduced in the version that used to allow invalid indices (< 0
or > m
). After the change that no more allows these indices and removes those test cases, this complexities are indeed unnecessary.
I'll update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
688b4d0
to
0d44639
Compare
inverse_permutation
and scatter
inverse_permutation
and scatter
inverse_permutation
and scatter
@github-actions crossbow submit -g cpp |
Revision: 6ac321c Submitted crossbow builds: ursacomputing/crossbow @ actions-9f34dc63f5 |
CI looks good to me. Is this ready for review again @zanmato1984 ? |
Ah, yes please. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, and CI is green, so I will merge this PR. Thanks a lot @zanmato1984 !
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 3222e2a. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
For background please see #44393.
When implementing the "scatter" function requested in #44393, I found it also useful to make it a public vector API. After a painful thinking, I decided to name it "permute". And when implementing permute, I found it fairly easy to implement it by first computing the "reverse indices" of the positions, and then invoking the existing "take", where I think "reverse_indices" itself can also be a useful public vector API. Thus the PR categorized them as "placement functions".
What changes are included in this PR?
Implement vector selection API
inverse_permutation
andscatter
, wherescatter(values, indices)
is implemented astake(values, inverse_permutation(indices))
.Are these changes tested?
UT included.
Are there any user-facing changes?
Yes, new public APIs added. Documents updated.