GH-41873: [Acero][C++] Reduce asof-join overhead by minimizing copies of the left hand side #41874

JerAguilon · 2024-05-29T15:35:59Z

Rationale for this change

This PR is a 30-65% performance optimization on the asof-join benchmarks, hence there are no visible behavioral/test changes.

Please read #41873, where I explain exactly why this optimization works. The idea is for the left hand side of the join, rather than copying data to the output arrays cell-by-cell, we can take Array::Slices, which are zero-copy and minimal overhead. This results in large speedups that scale with the number of LHS columns we are emitting.

What changes are included in this PR?

Note: To reduce merge conflict headaches, I have rebased this PR on top of #41125 since I am aware it was just accepted. this is now rebased on origin/main since this PR was merged

Aside from the changes in the parent PR, the changes are mostly localized to unmaterialized_table.h. We add a new field contiguous_srcs, which contains the set of table IDs that can be simply Sliced. asof_join_node.cc simply initializes an UnmaterializedCompositeTable with this new field:

    return CompositeTable{schema, inputs.size(), dst_to_src, pool,
                          /*contiguous_sources=*/{0}};

Which indicates that table ID 0 (i.e., the left hand side) can be quickly sliced.

Are these changes tested?

Yes - here are some results running arrow-acero-asof-join-benchmark: https://gist.github.com/JerAguilon/68568525f3818f60dc2ffcfe5eb6aba2

This was run on a 32GB Apple M1 Macbook Pro

Generally, we see a 30-65% improvement in rows/sec with no discernible changes in peak memory. The scale of improvement depends on the number of columns on the LHS.

Anecdotally, there are peak memory improvements at larger scale than the benchmarks. I've personally asof-joined 50GB+ parquet files. At this size, you can accumulate a large backlog of work on the producer thread. If you emit rows faster, then the backlog can be kept lower. Furthermore, there is even larger benefit if the left hand table has variable-length data types like strings or arrays. Copying these cells is very expensive!

Are there any user-facing changes?

GitHub Issue: [C++] Asof-joins inefficiently copy the left hand side #41873

github-actions · 2024-05-29T15:36:24Z

⚠️ GitHub issue #41873 has been automatically assigned in GitHub to PR creator.

JerAguilon · 2024-05-29T15:42:13Z

ccs for very helpful reviewers in the past: @westonpace @icexelloss @bkietz

JerAguilon · 2024-05-29T15:54:48Z

cpp/src/arrow/acero/unmaterialized_table.h

+    size_t start = -1;
+    size_t end = -1;
+
+    for (const auto& slice : slices) {


If it's not self-evident, the asof-join works by creating a CompositeEntry for each output row.

Since these so-called "contiguous inputs" are Sliceable, we squash these entries down as a preprocessing step. For example, suppose slices has a LHS table that looks like this:

{rb_addr: 1234, start: 1, end: 2}, {rb_addr: 1234, start: 2, end: 3}, {rb_addr: 1234, start: 3, end: 4}, ... {rb_addr: 1234, start: 3, end: 1001}, {rb_addr: 4321, start: 100001, end: 100002}, {rb_addr: 4321, start: 100002, end: 100003}, ... {rb_addr: 4321, start: 100002, end: 123456},

It's be silly to find derive slices in this potentially long vector for every column we mean to output. Thus, this function will squash this down to a very compact vector:

{rb_addr: 1234, start: 1, end: 1001}, {rb_addr: 4321, start: 100001, end: 123456},

Which we can quickly use to slice the appropriate output column(s).

JerAguilon · 2024-05-29T15:57:28Z

cpp/src/arrow/acero/unmaterialized_table.h

+      contiguous_blocks = std::unordered_map<int, std::vector<CompositeEntry>>();
+      contiguous_blocks.value().reserve(contiguous_srcs.size());
+      for (int src_table : contiguous_srcs) {
+        ARROW_ASSIGN_OR_RAISE(auto flattened_blocks, FlattenSlices(src_table));


Note we do this outside of materializeColumn. Let's say the LHS of the asof join has 1000 columns that we want to output. We don't want to flatten the the LHS slices 1000 times for each of the columns - instead, we just flatten once per contiguous table.

zanmato1984 · 2024-07-30T12:38:38Z

cpp/src/arrow/acero/asof_join_node.cc

+    // we can just take zero-copy slices of it. This is much faster than how other
+    // tables are treated, wherein we need to copy
+    return CompositeTable{schema, inputs.size(), dst_to_src, pool,
+                          /*contiguous_sources=*/{0}};


Suggested change

/*contiguous_sources=*/{0}};

/*contiguous_srcs_=*/{0}};

zanmato1984

Some nits.

zanmato1984 · 2024-07-30T12:59:55Z

cpp/src/arrow/acero/unmaterialized_table.h

+      DCHECK_EQ(chunk->type()->id(), type->id());
+      col.push_back(std::move(chunk));
+    }
+    return arrow::Concatenate(col);


Suggested change

return arrow::Concatenate(col);

return arrow::Concatenate(col, pool);

We need to pass the specified this->pool down to the Concatenate call.

zanmato1984

I think I understand the idea and the overall logic looks correct to me (I'll need some more time to look into more details).

Though the existing tests should cover this good, could you add dedicated UT for the change (e.g, for UnmaterializedCompositeTable operations with meaningful contiguous_srcs)? I think that helps in terms of both quality and readability. Thanks.

zanmato1984 · 2024-07-30T13:09:30Z

cpp/src/arrow/acero/unmaterialized_table.h

+      DCHECK_EQ(chunk->type()->id(), type->id());
+      col.push_back(std::move(chunk));
+    }
+    return arrow::Concatenate(col);


Concatenate internally copies the chunks which has its own overhead as well, though it should still outperform the row-by-row copying esp. with the flattened slices.

Just mentioning this to make sure we don't expect too much on this improvement :)

zanmato1984 · 2024-09-04T15:03:24Z

Hi @JerAguilon , I've put some comments on the PR a while ago. I wonder if you are still willing to move on with it? Or is there anything I can do to help? Thanks.

JerAguilon requested review from assignUser, kou, raulcd, paleolimbot, thisisnic, kevingurney, lidavidm, zeroshade, CurtHagenlocher, wgtmac and westonpace as code owners May 29, 2024 15:36

github-actions bot added Component: R Component: Java Component: Parquet Component: Go Component: C++ Component: Python Component: C# Component: Gandiva Component: GLib Component: MATLAB Component: Documentation labels May 29, 2024

JerAguilon changed the title ~~GH-41873: [Acero][C++] Greatly reduce the asof-join by minimizing array data copies on the left hand side~~ GH-41873: [Acero][C++] Reduce the asof-join overhead by minimizing copies of the left hand side May 29, 2024

github-actions bot added the awaiting review Awaiting review label May 29, 2024

JerAguilon force-pushed the no-copy-lhs-asof branch from 361756f to f105892 Compare May 29, 2024 15:37

github-actions bot removed Component: R Component: Java Component: Parquet Component: Go labels May 29, 2024

github-actions bot added awaiting committer review Awaiting committer review and removed Component: Python Component: C# Component: Gandiva Component: GLib Component: MATLAB Component: Documentation awaiting review Awaiting review labels May 29, 2024

JerAguilon commented May 29, 2024

View reviewed changes

JerAguilon changed the title ~~GH-41873: [Acero][C++] Reduce the asof-join overhead by minimizing copies of the left hand side~~ GH-41873: [Acero][C++] Reduce asof-join overhead by minimizing copies of the left hand side May 29, 2024

lidavidm removed their request for review May 29, 2024 23:47

JerAguilon added 3 commits May 30, 2024 12:03

Asof join - minimize copies on the left hand side of the join

7709b38

lint

ffe38b0

rebase onto origin/main now that apache#41125 is merged

51bc610

JerAguilon force-pushed the no-copy-lhs-asof branch from 6b19840 to 51bc610 Compare May 30, 2024 03:06

assignUser removed their request for review May 30, 2024 03:10

Document a param

500e568

zanmato1984 reviewed Jul 30, 2024

View reviewed changes

zanmato1984 requested changes Jul 30, 2024

View reviewed changes

zanmato1984 reviewed Jul 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-41873: [Acero][C++] Reduce asof-join overhead by minimizing copies of the left hand side #41874

GH-41873: [Acero][C++] Reduce asof-join overhead by minimizing copies of the left hand side #41874

JerAguilon commented May 29, 2024 •

edited

Loading

github-actions bot commented May 29, 2024

JerAguilon commented May 29, 2024

JerAguilon May 29, 2024 •

edited

Loading

JerAguilon May 29, 2024 •

edited

Loading

zanmato1984 Jul 30, 2024

zanmato1984 left a comment

zanmato1984 Jul 30, 2024

zanmato1984 Jul 30, 2024

zanmato1984 left a comment

zanmato1984 Jul 30, 2024

zanmato1984 commented Sep 4, 2024

	return arrow::Concatenate(col);
	return arrow::Concatenate(col, pool);

GH-41873: [Acero][C++] Reduce asof-join overhead by minimizing copies of the left hand side #41874

Are you sure you want to change the base?

GH-41873: [Acero][C++] Reduce asof-join overhead by minimizing copies of the left hand side #41874

Conversation

JerAguilon commented May 29, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented May 29, 2024

JerAguilon commented May 29, 2024

JerAguilon May 29, 2024 • edited Loading

Choose a reason for hiding this comment

JerAguilon May 29, 2024 • edited Loading

Choose a reason for hiding this comment

zanmato1984 Jul 30, 2024

Choose a reason for hiding this comment

zanmato1984 left a comment

Choose a reason for hiding this comment

zanmato1984 Jul 30, 2024

Choose a reason for hiding this comment

zanmato1984 Jul 30, 2024

Choose a reason for hiding this comment

zanmato1984 left a comment

Choose a reason for hiding this comment

zanmato1984 Jul 30, 2024

Choose a reason for hiding this comment

zanmato1984 commented Sep 4, 2024

JerAguilon commented May 29, 2024 •

edited

Loading

JerAguilon May 29, 2024 •

edited

Loading

JerAguilon May 29, 2024 •

edited

Loading