GH-44393: [C++][Compute] Vector selection functions `inverse_permutation` and `scatter` #44394

zanmato1984 · 2024-10-13T16:33:15Z

Rationale for this change

For background please see #44393.

When implementing the "scatter" function requested in #44393, I found it also useful to make it a public vector API. After a painful thinking, I decided to name it "permute". And when implementing permute, I found it fairly easy to implement it by first computing the "reverse indices" of the positions, and then invoking the existing "take", where I think "reverse_indices" itself can also be a useful public vector API. Thus the PR categorized them as "placement functions".

What changes are included in this PR?

Implement vector selection API inverse_permutation and scatter, where scatter(values, indices) is implemented as take(values, inverse_permutation(indices)).

Are these changes tested?

UT included.

Are there any user-facing changes?

Yes, new public APIs added. Documents updated.

GitHub Issue: [C++][Compute] "Scatter" vector functions #44393

github-actions · 2024-10-13T16:33:45Z

⚠️ GitHub issue #44393 has been automatically assigned in GitHub to PR creator.

zanmato1984 · 2024-10-13T16:35:38Z

Hi, @felipecrv @pitrou . Could you help to review this? I think this is a necessary building block for special forms (#41094 (comment)). Much appreciated!

felipecrv · 2024-10-21T22:29:26Z

@zanmato1984 I'm not doing code reviews with the same frequency lately, but I might take a look at this one during the week.

cpp/src/arrow/compute/api_vector.cc

cpp/src/arrow/compute/api_vector.h

felipecrv · 2024-10-30T23:47:24Z

cpp/src/arrow/compute/api_vector.h

+///
+/// For indices[i] = x, reverse_indices[x] = i. And reverse_indices[x] = null if x does
+/// not appear in the input indices. For indices[i] = x where x < 0 or x >= output_length,
+/// it is ignored. If multiple indices point to the same value, the last one is used.


I think this explanation is confusing, but we can work on this later.

cpp/src/arrow/compute/api_vector.h

cpp/src/arrow/compute/kernels/vector_placement_test.cc

zanmato1984 · 2024-11-07T05:29:33Z

Hi @felipecrv , the renaming is done. Would you like to proceed with the review? Thank you.

cpp/src/arrow/compute/api_vector.h

cpp/src/arrow/compute/kernels/codegen_internal.h

cpp/src/arrow/compute/api_vector.h

zanmato1984 · 2024-12-09T12:31:53Z

Hi @pitrou @felipecrv @mapleFU , shall we move on with this? I'll need this for my new (and possibly formal) implementation of special form.

Also cc @bkietz .

Thanks.

pitrou

Thank you @zanmato1984 and sorry for the delay in reviewing. You'll find some comments below.

Also, can you update this PR with recent git main and fix any conflicts?

docs/source/cpp/compute.rst

cpp/src/arrow/compute/api_vector.h

cpp/src/arrow/compute/kernels/vector_swizzle.cc

zanmato1984 · 2024-12-11T10:34:20Z

Hi @pitrou , you will see my commits and comments one by one, this is me addressing each of your comments. I'll explicitly let you know once they are all done. Before that, you can just ignore the notification.

Thank you for the review, appreciate that!

zanmato1984 · 2024-12-12T10:35:37Z

Hi @pitrou , I think I've addressed all the comments before. Would you take a second look? Thanks.

pitrou

Really sorry for the delay @zanmato1984 . Here is another review, but this mostly LGTM.

pitrou · 2025-01-15T09:09:57Z

docs/source/cpp/compute.rst

@@ -1929,3 +1929,28 @@ operation to the n-th and (n+abs(p))-th inputs.
  ``Subtract``. The period can be specified in :struct:`PairwiseOptions`.
 * \(2) Wraps around the result when overflow is detected.
 * \(3) Returns an ``Invalid`` :class:`Status` when overflow is detected.
+
+Swizzle functions


Instead of creating a category for this, perhaps we should just document these functions as part of the "Selections" category? After all, "scatter" is quite similar to "take" except that it takes a reverse mapping of the indices.

Agree.

Do you think if I can keep the term "swizzle" in the source code? It's used in source file names, comments, RegisterVectorSwizzle function name. I wanted to move them into the "selection" counterparts but those attempts just complicated the code organization pretty bad.

Yes, it's ok to use "swizzle" in the source code IMHO.

pitrou · 2025-01-15T09:15:26Z

cpp/src/arrow/compute/kernels/vector_swizzle.cc

+    DCHECK_EQ(validity_buf_, nullptr);
+
+    ARROW_ASSIGN_OR_RAISE(validity_buf_,
+                          AllocateEmptyBitmap(output_length_, ctx_->memory_pool()));


Note that AllocateEmptyBitmap will already memset to 0, you may want to allocate an uninitialized buffer instead.

pitrou · 2025-01-15T09:15:53Z

cpp/src/arrow/compute/kernels/vector_swizzle.cc

+    ARROW_ASSIGN_OR_RAISE(validity_buf_,
+                          AllocateEmptyBitmap(output_length_, ctx_->memory_pool()));
+    auto validity = validity_buf_->mutable_data_as<uint8_t>();
+    std::memset(validity, valid ? 0xff : 0, validity_buf_->size());


Note: if allocating an uninitialized buffer, should use validity_buf_->capacity() to make sure that padding bits are initialized too.

Very useful advice! Addressed.

pitrou · 2025-01-15T09:27:19Z

cpp/src/arrow/compute/kernels/vector_swizzle_test.cc

+static const std::vector<std::shared_ptr<DataType>> kSignedIntegerTypes = {
+    int8(), int16(), int32(), int64()};
+
+static const std::vector<std::shared_ptr<DataType>> kBinaryTypes = {
+    binary(), utf8(), large_binary(), large_utf8()};


I'd be surprised if we didn't already have functions to return these sets of types. Perhaps SignedIntTypes() and BaseBinaryTypes()?

Indeed. Addressed.

pitrou · 2025-01-15T09:35:22Z

cpp/src/arrow/compute/kernels/vector_swizzle_test.cc

+}
+
+TEST(Scatter, Numeric) {
+  for (const auto& type : kSignedIntegerTypes) {


Nit: call this  value_type for clarity?

(also, why not test all number types: integers + floats?)

Both addressed.

I was worrying our ArrayFromJSON isn't that robust (to parse integer into float). But it actually does the trick!

pitrou · 2025-01-15T09:41:55Z

cpp/src/arrow/compute/kernels/vector_swizzle_test.cc

+///   I = indices
+///   m = max_index
+///   I' = ReplaceWithMask(I, i > m, null)
+///   I'' = ReplaceWithMask(I, i < 0, null) + m + 1


Why are the i < 0 and i > m conditions necessary?

No, they are not. Thank you.

They are introduced in the version that used to allow invalid indices (< 0 or > m). After the change that no more allows these indices and removes those test cases, this complexities are indeed unnecessary.

I'll update.

pitrou · 2025-01-15T15:26:37Z

@github-actions crossbow submit -g cpp

github-actions · 2025-01-15T15:29:13Z

Revision: 6ac321c

Submitted crossbow builds: ursacomputing/crossbow @ actions-9f34dc63f5

Task	Status
example-cpp-minimal-build-static
example-cpp-minimal-build-static-system-dependency
example-cpp-tutorial
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp-ubuntu-20.04-cuda-11.2.2
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-fedora-39-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-bundled-offline
test-ubuntu-24.04-cpp-gcc-13-bundled
test-ubuntu-24.04-cpp-gcc-14
test-ubuntu-24.04-cpp-minimal-with-formats
test-ubuntu-24.04-cpp-thread-sanitizer

pitrou · 2025-01-15T16:04:40Z

CI looks good to me. Is this ready for review again @zanmato1984 ?

zanmato1984 · 2025-01-15T16:07:11Z

CI looks good to me. Is this ready for review again @zanmato1984 ?

Ah, yes please. Thank you!

pitrou

LGTM, and CI is green, so I will merge this PR. Thanks a lot @zanmato1984 !

conbench-apache-arrow · 2025-01-17T10:08:06Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 3222e2a.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

zanmato1984 requested a review from felipecrv October 13, 2024 16:33

github-actions bot added Component: C++ Component: Documentation awaiting review Awaiting review labels Oct 13, 2024

zanmato1984 changed the title ~~GH-44393: Placement vector functions~~ GH-44393: [C++][Compute] Placement vector functions Oct 13, 2024

felipecrv reviewed Oct 31, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Oct 31, 2024

zanmato1984 commented Oct 31, 2024

View reviewed changes

cpp/src/arrow/compute/kernels/vector_placement_test.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Oct 31, 2024

zanmato1984 changed the title ~~GH-44393: [C++][Compute] Placement vector functions~~ GH-44393: [C++][Compute] Swizzle vector functions Nov 7, 2024

pitrou reviewed Nov 8, 2024

View reviewed changes

cpp/src/arrow/compute/api_vector.h Outdated Show resolved Hide resolved

mapleFU reviewed Nov 8, 2024

View reviewed changes

cpp/src/arrow/compute/kernels/codegen_internal.h Show resolved Hide resolved

cpp/src/arrow/compute/api_vector.h Show resolved Hide resolved

pitrou requested changes Dec 10, 2024

View reviewed changes

pitrou requested changes Jan 15, 2025

View reviewed changes

zanmato1984 added 3 commits January 15, 2025 19:35

WIP

8de08ae

WIP

d36330e

Add permute function options

e68a6d4

zanmato1984 added 11 commits January 15, 2025 19:36

Show no mercy to index out of bounds

fa1d9f2

Use type error instead of invalid

4527c47

Remove errornous predict false

5419a85

Avoid uninitialized data buf

a3bd7c3

Coding convention of instantce variables

962749b

Optimize buffer initializing

ff71d77

Reduce typed tests

6805784

Naming

e30f33c

Remove repetition of test cases

28bfddc

Doc about output length

75d96d6

Fix ci error

0d44639

zanmato1984 force-pushed the vector-placement branch from 688b4d0 to 0d44639 Compare January 15, 2025 11:37

zanmato1984 added 5 commits January 15, 2025 19:49

Move new functions into selection category in doc

953d3b1

Allocate uninitialized buffer and fill the capacity bytes

b82ad94

type -> value_type

eaa9a3c

Simplify chunked cases

b1f1208

Use common type lists and test for more numeric types

4342033

zanmato1984 changed the title ~~GH-44393: [C++][Compute] Swizzle vector functions~~ GH-44393: [C++][Compute] Selection vector functions inverse_permutation and scatter Jan 15, 2025

zanmato1984 changed the title ~~GH-44393: [C++][Compute] Selection vector functions inverse_permutation and scatter~~ GH-44393: [C++][Compute] Vector selection functions inverse_permutation and scatter Jan 15, 2025

zanmato1984 added 2 commits January 15, 2025 23:06

Bump since version

8e4ecab

Refine function docs

6ac321c

pitrou approved these changes Jan 15, 2025

View reviewed changes

pitrou merged commit 3222e2a into apache:main Jan 15, 2025
38 of 39 checks passed

pitrou removed the awaiting change review Awaiting change review label Jan 15, 2025

pitrou mentioned this pull request Jan 15, 2025

[C++][Compute] "Scatter" vector functions #44393

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-44393: [C++][Compute] Vector selection functions `inverse_permutation` and `scatter` #44394

GH-44393: [C++][Compute] Vector selection functions `inverse_permutation` and `scatter` #44394

zanmato1984 commented Oct 13, 2024 •

edited

Loading

github-actions bot commented Oct 13, 2024

zanmato1984 commented Oct 13, 2024

felipecrv commented Oct 21, 2024

felipecrv Oct 30, 2024

zanmato1984 commented Nov 7, 2024

zanmato1984 commented Dec 9, 2024

pitrou left a comment

zanmato1984 commented Dec 11, 2024

zanmato1984 commented Dec 12, 2024

pitrou left a comment

pitrou Jan 15, 2025

zanmato1984 Jan 15, 2025 •

edited

Loading

pitrou Jan 15, 2025

pitrou Jan 15, 2025 •

edited

Loading

zanmato1984 Jan 15, 2025

pitrou Jan 15, 2025

zanmato1984 Jan 15, 2025

pitrou Jan 15, 2025

zanmato1984 Jan 15, 2025

pitrou Jan 15, 2025 •

edited

Loading

pitrou Jan 15, 2025

zanmato1984 Jan 15, 2025

pitrou Jan 15, 2025

zanmato1984 Jan 15, 2025

zanmato1984 Jan 15, 2025

pitrou commented Jan 15, 2025

github-actions bot commented Jan 15, 2025

pitrou commented Jan 15, 2025

zanmato1984 commented Jan 15, 2025

pitrou left a comment

conbench-apache-arrow bot commented Jan 17, 2025

GH-44393: [C++][Compute] Vector selection functions inverse_permutation and scatter #44394

GH-44393: [C++][Compute] Vector selection functions inverse_permutation and scatter #44394

Conversation

zanmato1984 commented Oct 13, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Oct 13, 2024

zanmato1984 commented Oct 13, 2024

felipecrv commented Oct 21, 2024

Choose a reason for hiding this comment

zanmato1984 commented Nov 7, 2024

zanmato1984 commented Dec 9, 2024

pitrou left a comment

Choose a reason for hiding this comment

zanmato1984 commented Dec 11, 2024

zanmato1984 commented Dec 12, 2024

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zanmato1984 Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Jan 15, 2025

github-actions bot commented Jan 15, 2025

pitrou commented Jan 15, 2025

zanmato1984 commented Jan 15, 2025

pitrou left a comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Jan 17, 2025

GH-44393: [C++][Compute] Vector selection functions `inverse_permutation` and `scatter` #44394

GH-44393: [C++][Compute] Vector selection functions `inverse_permutation` and `scatter` #44394

zanmato1984 commented Oct 13, 2024 •

edited

Loading

zanmato1984 Jan 15, 2025 •

edited

Loading

pitrou Jan 15, 2025 •

edited

Loading

pitrou Jan 15, 2025 •

edited

Loading