Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43953: [C++] Add tests based on random data and benchmarks to ChunkResolver::ResolveMany #43954

Merged
merged 15 commits into from
Sep 24, 2024

Conversation

felipecrv
Copy link
Contributor

@felipecrv felipecrv commented Sep 4, 2024

Rationale for this change

Improve tests and add benchmarks. I wrote the tests and benchmarks while trying to improve the performance of ResolveMany and failing at it.

What changes are included in this PR?

Tests, benchmarks, and changes that don't really affect performance but might unlock more optimization opportunities in the future.

Are these changes tested?

Yes.

@felipecrv felipecrv changed the title [C++] GH-43953: [C++] Add tests based on random data and benchmarks to ChunkResolver::ResolveMany Sep 4, 2024
@apache apache deleted a comment from github-actions bot Sep 4, 2024
@felipecrv felipecrv requested a review from pitrou September 4, 2024 15:05
Copy link

github-actions bot commented Sep 4, 2024

⚠️ GitHub issue #43953 has been automatically assigned in GitHub to PR creator.

@felipecrv
Copy link
Contributor Author

The chunk_hint stuff really works. I forgot to plot it, but without it the throughput gains don't exist when indices are sorted.

resolvemany

@pitrou
Copy link
Member

pitrou commented Sep 4, 2024

Results here (AMD Zen 2 CPU, gcc):

ResolveManyUInt16Random/65535/10000/508         7072 ns         7071 ns        98917 items_per_second=71.8471M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt16Random/65535/100/508           4098 ns         4097 ns       170804 items_per_second=123.986M/s logical_len=65.535k num_chunks=100
ResolveManyUInt16Random/65535/10000/32764    1812677 ns      1812178 ns          393 items_per_second=18.0799M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt16Random/65535/100/32764       723697 ns       723335 ns          876 items_per_second=45.2958M/s logical_len=65.535k num_chunks=100

ResolveManyUInt32Random/65535/10000/508         7796 ns         7795 ns        88954 items_per_second=65.1731M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt32Random/65535/100/508           4397 ns         4396 ns       157489 items_per_second=115.566M/s logical_len=65.535k num_chunks=100
ResolveManyUInt32Random/65535/10000/32764    1848926 ns      1848181 ns          379 items_per_second=17.7277M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt32Random/65535/100/32764       752273 ns       752045 ns          917 items_per_second=43.5665M/s logical_len=65.535k num_chunks=100

ResolveManyUInt64Random/65535/10000/508         6731 ns         6730 ns       103505 items_per_second=75.4878M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt64Random/65535/100/508           4042 ns         4041 ns       172940 items_per_second=125.717M/s logical_len=65.535k num_chunks=100
ResolveManyUInt64Random/65535/10000/32764    1809563 ns      1809068 ns          387 items_per_second=18.111M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt64Random/65535/100/32764       731165 ns       730968 ns          953 items_per_second=44.8227M/s logical_len=65.535k num_chunks=100

ResolveManyUInt16Sorted/65535/10000/508         6233 ns         6231 ns       111713 items_per_second=81.5236M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt16Sorted/65535/100/508            955 ns          955 ns       728588 items_per_second=532.031M/s logical_len=65.535k num_chunks=100
ResolveManyUInt16Sorted/65535/10000/32764     314016 ns       313931 ns         2221 items_per_second=104.367M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt16Sorted/65535/100/32764        41365 ns        41356 ns        17068 items_per_second=792.247M/s logical_len=65.535k num_chunks=100

ResolveManyUInt32Sorted/65535/10000/508         7176 ns         7174 ns        91879 items_per_second=70.8084M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt32Sorted/65535/100/508            928 ns          928 ns       749739 items_per_second=547.568M/s logical_len=65.535k num_chunks=100
ResolveManyUInt32Sorted/65535/10000/32764     326595 ns       326507 ns         2145 items_per_second=100.347M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt32Sorted/65535/100/32764        34162 ns        34154 ns        20518 items_per_second=959.309M/s logical_len=65.535k num_chunks=100

ResolveManyUInt64Sorted/65535/10000/508         6057 ns         6055 ns       113029 items_per_second=83.8968M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt64Sorted/65535/100/508            994 ns          993 ns       700153 items_per_second=511.41M/s logical_len=65.535k num_chunks=100
ResolveManyUInt64Sorted/65535/10000/32764     316107 ns       316023 ns         2212 items_per_second=103.676M/s logical_len=65.535k num_chunks=10k
ResolveManyUInt64Sorted/65535/100/32764        38441 ns        38427 ns        20599 items_per_second=852.64M/s logical_len=65.535k num_chunks=100

@pitrou
Copy link
Member

pitrou commented Sep 4, 2024

Ironically, uint32 seems slightly slower than both uint16 and uint64. Not sure that's due to the compiler or to the CPU.


template <typename IndexType>
void ResolveManySetArgs(benchmark::internal::Benchmark* bench) {
constexpr int32_t kNonAligned = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what this is for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimizations I was experimenting with involved some unrolling so I didn't want input values to neatly align to powers of 2.

case 2:
case 4:
case 8:
bench->Args({kChunkedArrayLength, /*num_chunks*/ 10000, kNumIndicesFew});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10000 chunks is really a lot and I'm not sure it's really useful to test different numbers of chunks. By accumulating different combinations of parameters we make the benchmark results less immediately readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The huge number is necessary (even though it's a bit unrealistic) to measure how effective the binary search is at reducing the search space (proportional to the number of chunks).

@@ -55,7 +55,7 @@ ChunkedArray::ChunkedArray(ArrayVector chunks, std::shared_ptr<DataType> type)
<< "cannot construct ChunkedArray from empty vector and omitted type";
type_ = chunks_[0]->type();
}

ARROW_CHECK_LE(chunks.size(), std::numeric_limits<int>::max());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this limit useful if it ends up not making performance better anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is useful in ensuring we are not generating more chunks than can be addressed by the index type.

cpp/src/arrow/chunked_array_test.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Sep 4, 2024
@felipecrv
Copy link
Contributor Author

felipecrv commented Sep 4, 2024

Ironically, uint32 seems slightly slower than both uint16 and uint64. Not sure that's due to the compiler or to the CPU.

Same on 12th Gen Intel(R) Core(TM) i7-12800H. This is the reason I changed lo and hi to uint32_t and nothing changed in the benchmarks.

Things are saner on the Apple M1 Pro with 32-bit being the fastest.

Run on (8 X 24 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x8)
Load Average: 14.23, 6.75, 4.28
------------------------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------
ResolveManyUInt16Random/65535/100/508       2769 ns         2747 ns       252543 items_per_second=184.959M/s logical_len=65.535k num_chunks=100
ResolveManyUInt32Random/65535/100/508       2715 ns         2689 ns       257431 items_per_second=188.93M/s logical_len=65.535k num_chunks=100
ResolveManyUInt64Random/65535/100/508       2928 ns         2757 ns       252188 items_per_second=184.281M/s logical_len=65.535k num_chunks=100
ResolveManyUInt16Sorted/65535/100/508       1023 ns         1012 ns       693694 items_per_second=501.751M/s logical_len=65.535k num_chunks=100
ResolveManyUInt32Sorted/65535/100/508        929 ns          920 ns       761052 items_per_second=551.881M/s logical_len=65.535k num_chunks=100
ResolveManyUInt64Sorted/65535/100/508        933 ns          926 ns       759878 items_per_second=548.86M/s logical_len=65.535k num_chunks=100

NOTE: I have a lot of stuff running on the M1 during this benchmark :)

@pitrou
Copy link
Member

pitrou commented Sep 4, 2024

There are a number of CI failures that need fixing.

@felipecrv
Copy link
Contributor Author

There are a number of CI failures that need fixing.

TIL uniform_int_distribution can't take (u)int8_t as type parameter on MSVC builds.

@felipecrv felipecrv requested a review from pitrou September 11, 2024 17:27
@pitrou
Copy link
Member

pitrou commented Sep 12, 2024

The chunk_hint stuff really works. I forgot to plot it, but without it the throughput gains don't exist when indices are sorted.

Yes, it's also quite dramatic for the actual Sort implementation :-)

@pitrou
Copy link
Member

pitrou commented Sep 12, 2024

There are still a couple CI failures that need fixing.

@pitrou
Copy link
Member

pitrou commented Sep 17, 2024

There are still a couple compilation errors on CI it seems :-)

@felipecrv
Copy link
Contributor Author

@pitrou now it's all green. The only failure is unrelated.

@pitrou
Copy link
Member

pitrou commented Sep 24, 2024

Just a question: do you mean to keep the 64-bit to 32-bit chunk index change?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after a minor push. I'll merge if CI is green.

@pitrou pitrou merged commit 83f35de into apache:main Sep 24, 2024
36 of 40 checks passed
@pitrou pitrou removed the awaiting change review Awaiting change review label Sep 24, 2024
@felipecrv felipecrv deleted the chunk_resolver_bench branch September 24, 2024 16:59
@felipecrv
Copy link
Contributor Author

Just a question: do you mean to keep the 64-bit to 32-bit chunk index change?

Yes I do.

Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 83f35de.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 16 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants