[EXP][C++] Deduplicate schemas when scanning Dataset #45340

pitrou · 2025-01-23T17:00:54Z

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

pitrou · 2025-01-23T17:01:06Z

@ursabot please benchmark

ursabot · 2025-01-23T17:01:12Z

Benchmark runs are scheduled for commit f681035. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

pitrou · 2025-01-23T17:24:00Z

@ursabot please benchmark

ursabot · 2025-01-23T17:24:06Z

Benchmark runs are scheduled for commit a0503f3. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

pitrou · 2025-01-23T17:25:43Z

@icexelloss This is a quick experiment that you might want to try out on a real-world use case. I don't seem to get any tangible benefits on a synthetic dataset, though it might be due to memory fragmentation.

conbench-apache-arrow · 2025-01-23T17:32:39Z

Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit f681035.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

conbench-apache-arrow · 2025-01-23T23:53:58Z

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit a0503f3.

There were 775 benchmark results indicating a performance regression:

Pull Request Run on test-mac-arm at 2025-01-23 18:46:18Z
- TakeChunkedChunkedStringFewMonotonicIndices (C++) with params=4194304/1, source=cpp-micro, suite=arrow-compute-vector-selection-benchmark
- ListSliceStringListViewWithStop (C++) with params=65536/0, source=cpp-micro, suite=arrow-compute-scalar-list-benchmark
and 773 more (see the report linked below)

The full Conbench report has more details.

pitrou · 2025-01-24T13:48:26Z

Ok, there are so many unrelated "regressions" in the report above that I'm going to launch another benchmarking run, as it's likely that external factors have influenced that run.

pitrou · 2025-01-24T13:48:33Z

@ursabot please benchmark

ursabot · 2025-01-24T13:48:36Z

Commit a0503f3 already has scheduled benchmark runs.

pitrou · 2025-01-24T13:49:15Z

@ursabot please benchmark

ursabot · 2025-01-24T13:49:20Z

Benchmark runs are scheduled for commit 335a46b. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

conbench-apache-arrow · 2025-01-24T20:23:36Z

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit 335a46b.

There were 23 benchmark results indicating a performance regression:

Pull Request Run on test-mac-arm at 2025-01-24 15:13:21Z
- BenchmarkTemporal (C++) with params=<Subsecond, non_zoned>/4194304/0, source=cpp-micro, suite=arrow-compute-scalar-temporal-benchmark
- FilterOverhead (C++) with params=selectivity_benchmark/batch_size:100000/null_prob:100/bool_true_prob:50/real_time, source=cpp-micro, suite=arrow-acero-filter-benchmark
and 21 more (see the report linked below)

The full Conbench report has more details.

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Jan 23, 2025

pitrou force-pushed the exp_deduplicate_schema branch 2 times, most recently from 9610573 to a0503f3 Compare January 23, 2025 17:23

[EXP][C++] Deduplicate schemas when scanning Dataset

335a46b

pitrou force-pushed the exp_deduplicate_schema branch from a0503f3 to 335a46b Compare January 24, 2025 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EXP][C++] Deduplicate schemas when scanning Dataset #45340

[EXP][C++] Deduplicate schemas when scanning Dataset #45340

pitrou commented Jan 23, 2025

pitrou commented Jan 23, 2025

ursabot commented Jan 23, 2025

pitrou commented Jan 23, 2025

ursabot commented Jan 23, 2025

pitrou commented Jan 23, 2025

conbench-apache-arrow bot commented Jan 23, 2025

conbench-apache-arrow bot commented Jan 23, 2025

pitrou commented Jan 24, 2025

pitrou commented Jan 24, 2025

ursabot commented Jan 24, 2025

pitrou commented Jan 24, 2025

ursabot commented Jan 24, 2025

conbench-apache-arrow bot commented Jan 24, 2025

[EXP][C++] Deduplicate schemas when scanning Dataset #45340

Are you sure you want to change the base?

[EXP][C++] Deduplicate schemas when scanning Dataset #45340

Conversation

pitrou commented Jan 23, 2025

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

pitrou commented Jan 23, 2025

ursabot commented Jan 23, 2025

pitrou commented Jan 23, 2025

ursabot commented Jan 23, 2025

pitrou commented Jan 23, 2025

conbench-apache-arrow bot commented Jan 23, 2025

conbench-apache-arrow bot commented Jan 23, 2025

pitrou commented Jan 24, 2025

pitrou commented Jan 24, 2025

ursabot commented Jan 24, 2025

pitrou commented Jan 24, 2025

ursabot commented Jan 24, 2025

conbench-apache-arrow bot commented Jan 24, 2025