Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44393: [C++][Compute] Vector selection functions inverse_permutation and scatter #44394

Merged
merged 61 commits into from
Jan 15, 2025

Conversation

zanmato1984
Copy link
Contributor

@zanmato1984 zanmato1984 commented Oct 13, 2024

Rationale for this change

For background please see #44393.

When implementing the "scatter" function requested in #44393, I found it also useful to make it a public vector API. After a painful thinking, I decided to name it "permute". And when implementing permute, I found it fairly easy to implement it by first computing the "reverse indices" of the positions, and then invoking the existing "take", where I think "reverse_indices" itself can also be a useful public vector API. Thus the PR categorized them as "placement functions".

What changes are included in this PR?

Implement vector selection API inverse_permutation and scatter, where scatter(values, indices) is implemented as take(values, inverse_permutation(indices)).

Are these changes tested?

UT included.

Are there any user-facing changes?

Yes, new public APIs added. Documents updated.

@zanmato1984 zanmato1984 requested a review from felipecrv October 13, 2024 16:33
Copy link

⚠️ GitHub issue #44393 has been automatically assigned in GitHub to PR creator.

@zanmato1984
Copy link
Contributor Author

Hi, @felipecrv @pitrou . Could you help to review this? I think this is a necessary building block for special forms (#41094 (comment)). Much appreciated!

@zanmato1984 zanmato1984 changed the title GH-44393: Placement vector functions GH-44393: [C++][Compute] Placement vector functions Oct 13, 2024
@felipecrv
Copy link
Contributor

@zanmato1984 I'm not doing code reviews with the same frequency lately, but I might take a look at this one during the week.

cpp/src/arrow/compute/api_vector.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/api_vector.h Outdated Show resolved Hide resolved
///
/// For indices[i] = x, reverse_indices[x] = i. And reverse_indices[x] = null if x does
/// not appear in the input indices. For indices[i] = x where x < 0 or x >= output_length,
/// it is ignored. If multiple indices point to the same value, the last one is used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this explanation is confusing, but we can work on this later.

cpp/src/arrow/compute/api_vector.h Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Oct 31, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Oct 31, 2024
@zanmato1984
Copy link
Contributor Author

Hi @felipecrv , the renaming is done. Would you like to proceed with the review? Thank you.

@zanmato1984 zanmato1984 changed the title GH-44393: [C++][Compute] Placement vector functions GH-44393: [C++][Compute] Swizzle vector functions Nov 7, 2024
@zanmato1984
Copy link
Contributor Author

Hi @pitrou @felipecrv @mapleFU , shall we move on with this? I'll need this for my new (and possibly formal) implementation of special form.

Also cc @bkietz .

Thanks.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zanmato1984 and sorry for the delay in reviewing. You'll find some comments below.

Also, can you update this PR with recent git main and fix any conflicts?

docs/source/cpp/compute.rst Outdated Show resolved Hide resolved
docs/source/cpp/compute.rst Outdated Show resolved Hide resolved
cpp/src/arrow/compute/api_vector.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/api_vector.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/api_vector.h Show resolved Hide resolved
cpp/src/arrow/compute/kernels/vector_swizzle.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/vector_swizzle.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/vector_swizzle.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/vector_swizzle.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/vector_swizzle.cc Outdated Show resolved Hide resolved
@zanmato1984
Copy link
Contributor Author

Hi @pitrou , you will see my commits and comments one by one, this is me addressing each of your comments. I'll explicitly let you know once they are all done. Before that, you can just ignore the notification.

Thank you for the review, appreciate that!

@zanmato1984
Copy link
Contributor Author

Hi @pitrou , I think I've addressed all the comments before. Would you take a second look? Thanks.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really sorry for the delay @zanmato1984 . Here is another review, but this mostly LGTM.

@@ -1929,3 +1929,28 @@ operation to the n-th and (n+abs(p))-th inputs.
``Subtract``. The period can be specified in :struct:`PairwiseOptions`.
* \(2) Wraps around the result when overflow is detected.
* \(3) Returns an ``Invalid`` :class:`Status` when overflow is detected.

Swizzle functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating a category for this, perhaps we should just document these functions as part of the "Selections" category? After all, "scatter" is quite similar to "take" except that it takes a reverse mapping of the indices.

Copy link
Contributor Author

@zanmato1984 zanmato1984 Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

Do you think if I can keep the term "swizzle" in the source code? It's used in source file names, comments, RegisterVectorSwizzle function name. I wanted to move them into the "selection" counterparts but those attempts just complicated the code organization pretty bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's ok to use "swizzle" in the source code IMHO.

DCHECK_EQ(validity_buf_, nullptr);

ARROW_ASSIGN_OR_RAISE(validity_buf_,
AllocateEmptyBitmap(output_length_, ctx_->memory_pool()));
Copy link
Member

@pitrou pitrou Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that AllocateEmptyBitmap will already memset to 0, you may want to allocate an uninitialized buffer instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

ARROW_ASSIGN_OR_RAISE(validity_buf_,
AllocateEmptyBitmap(output_length_, ctx_->memory_pool()));
auto validity = validity_buf_->mutable_data_as<uint8_t>();
std::memset(validity, valid ? 0xff : 0, validity_buf_->size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: if allocating an uninitialized buffer, should use validity_buf_->capacity() to make sure that padding bits are initialized too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very useful advice! Addressed.

Comment on lines 33 to 37
static const std::vector<std::shared_ptr<DataType>> kSignedIntegerTypes = {
int8(), int16(), int32(), int64()};

static const std::vector<std::shared_ptr<DataType>> kBinaryTypes = {
binary(), utf8(), large_binary(), large_utf8()};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be surprised if we didn't already have functions to return these sets of types. Perhaps SignedIntTypes() and BaseBinaryTypes()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Addressed.

}

TEST(Scatter, Numeric) {
for (const auto& type : kSignedIntegerTypes) {
Copy link
Member

@pitrou pitrou Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: call this  value_type for clarity?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, why not test all number types: integers + floats?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both addressed.

I was worrying our ArrayFromJSON isn't that robust (to parse integer into float). But it actually does the trick!

/// I = indices
/// m = max_index
/// I' = ReplaceWithMask(I, i > m, null)
/// I'' = ReplaceWithMask(I, i < 0, null) + m + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the i < 0 and i > m conditions necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they are not. Thank you.

They are introduced in the version that used to allow invalid indices (< 0 or > m). After the change that no more allows these indices and removes those test cases, this complexities are indeed unnecessary.

I'll update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

@zanmato1984 zanmato1984 changed the title GH-44393: [C++][Compute] Swizzle vector functions GH-44393: [C++][Compute] Selection vector functions inverse_permutation and scatter Jan 15, 2025
@zanmato1984 zanmato1984 changed the title GH-44393: [C++][Compute] Selection vector functions inverse_permutation and scatter GH-44393: [C++][Compute] Vector selection functions inverse_permutation and scatter Jan 15, 2025
@pitrou
Copy link
Member

pitrou commented Jan 15, 2025

@github-actions crossbow submit -g cpp

Copy link

Revision: 6ac321c

Submitted crossbow builds: ursacomputing/crossbow @ actions-9f34dc63f5

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-20.04-cuda-11.2.2 GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@pitrou
Copy link
Member

pitrou commented Jan 15, 2025

CI looks good to me. Is this ready for review again @zanmato1984 ?

@zanmato1984
Copy link
Contributor Author

CI looks good to me. Is this ready for review again @zanmato1984 ?

Ah, yes please. Thank you!

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and CI is green, so I will merge this PR. Thanks a lot @zanmato1984 !

@pitrou pitrou merged commit 3222e2a into apache:main Jan 15, 2025
38 of 39 checks passed
@pitrou pitrou removed the awaiting change review Awaiting change review label Jan 15, 2025
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 3222e2a.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants