-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce some simple benchmarks for rolling window aggregations #17613
Introduce some simple benchmarks for rolling window aggregations #17613
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a minor nit, but otherwise, LGTM!
I've re-kicked the CI, because of transient failures. |
f1a6a36
to
cb2530f
Compare
NVBENCH_BENCH_TYPES(bench_row_fixed_rolling_sum, | ||
NVBENCH_TYPE_AXES(nvbench::type_list<std::int32_t, double>)) | ||
.set_name("row_fixed_rolling_sum") | ||
.add_int64_power_of_two_axis("num_rows", {14, 22, 28}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Is there any reason for choice of these numbers of num_rows? or just random.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smallish, mediumish, and largeish. But other than that, not particularly attached to them.
/merge |
… rolling windows (#17623) Having introduced benchmarks of rolling window performance in #17613, we can now look at what happens when cleaning up some the implementation. ### Breaking change Previously, the most general "variable-sized" rolling window function applied a transform iterator (sometimes multiple times) to the user-provided `preceding` and `following` columns. We now _require_, and document, that the user provides values that are always in bounds. This simplifies the book-keeping, and allows us to remove some (now redundant) checks from the rolling kernel. ### Performance improvements In various places, we accept fixed size integers as row-based offsets for rolling windows. We used to materialise these into columns representing the clamped `preceding` and `following` columns. However, we did so inconsistently with respect to which side we would clamp (after the change to allow negative window offsets). To fix this, rationalise this clamping in one place, and just use a transform iterator. We now do this consistently for both grouped and ungrouped fixed-size windows. ### Benchmark results Comparing da4accc (the tip of #17613) and this branch: ### nvbench compare output Headline numbers: - group-aware fixed rolling windows between 10 and 30% faster, use 30% less memory - ungrouped fixed rolling windows between 10 and 50% faster, use 50% less memory - ungrouped variable rolling windows basically no change (so no regression) <details> <summary> All the data </summary> latest/_deps/nvbench-src/scripts/nvbench_compare.py da4accc.json c3973aa36c373e013c6c90f8d7bd58fe0aed6880.json ['da4acccf6b76a93b1e9395b51950a17c45da99c3.json', 'c3973aa36c373e013c6c90f8d7bd58fe0aed6880.json'] # row_grouped_rolling_sum ## [0] NVIDIA RTX A6000 | T | num_rows | preceding_size | following_size | min_periods | cardinality | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |-----|------------|------------------|------------------|---------------|---------------|------------|-------------|------------|-------------|--------------|---------|----------| | I32 | 2^14 | 1 | 2 | 1 | 10 | 123.556 us | 3.90% | 111.199 us | 4.09% | -12.357 us | -10.00% | FAIL | | I32 | 2^28 | 1 | 2 | 1 | 10 | 24.945 ms | 0.13% | 17.126 ms | 0.26% | -7819.548 us | -31.35% | FAIL | | I32 | 2^14 | 10 | 2 | 1 | 10 | 122.737 us | 4.09% | 109.055 us | 4.40% | -13.682 us | -11.15% | FAIL | | I32 | 2^28 | 10 | 2 | 1 | 10 | 24.944 ms | 0.18% | 17.151 ms | 0.20% | -7793.093 us | -31.24% | FAIL | | I32 | 2^14 | 1 | 2 | 1 | 100 | 119.947 us | 4.07% | 108.432 us | 5.18% | -11.515 us | -9.60% | FAIL | | I32 | 2^28 | 1 | 2 | 1 | 100 | 24.954 ms | 0.15% | 17.149 ms | 0.20% | -7805.558 us | -31.28% | FAIL | | I32 | 2^14 | 10 | 2 | 1 | 100 | 123.919 us | 3.85% | 109.812 us | 4.63% | -14.107 us | -11.38% | FAIL | | I32 | 2^28 | 10 | 2 | 1 | 100 | 24.955 ms | 0.14% | 17.173 ms | 0.18% | -7782.547 us | -31.19% | FAIL | | I32 | 2^14 | 1 | 2 | 1 | 1000000 | 121.896 us | 4.59% | 109.115 us | 4.57% | -12.781 us | -10.49% | FAIL | | I32 | 2^28 | 1 | 2 | 1 | 1000000 | 25.291 ms | 0.15% | 17.460 ms | 0.21% | -7830.713 us | -30.96% | FAIL | | I32 | 2^14 | 10 | 2 | 1 | 1000000 | 126.450 us | 3.65% | 111.787 us | 3.37% | -14.662 us | -11.60% | FAIL | | I32 | 2^28 | 10 | 2 | 1 | 1000000 | 25.295 ms | 0.13% | 17.593 ms | 0.19% | -7701.923 us | -30.45% | FAIL | | I32 | 2^14 | 1 | 2 | 1 | 100000000 | 122.268 us | 4.42% | 109.338 us | 5.17% | -12.930 us | -10.57% | FAIL | | I32 | 2^28 | 1 | 2 | 1 | 100000000 | 28.409 ms | 0.12% | 20.422 ms | 0.16% | -7987.039 us | -28.11% | FAIL | | I32 | 2^14 | 10 | 2 | 1 | 100000000 | 127.440 us | 2.96% | 113.009 us | 4.54% | -14.431 us | -11.32% | FAIL | | I32 | 2^28 | 10 | 2 | 1 | 100000000 | 28.395 ms | 0.12% | 20.789 ms | 0.17% | -7606.146 us | -26.79% | FAIL | | F64 | 2^14 | 1 | 2 | 1 | 10 | 119.900 us | 4.27% | 110.024 us | 5.21% | -9.876 us | -8.24% | FAIL | | F64 | 2^28 | 1 | 2 | 1 | 10 | 26.547 ms | 0.14% | 18.790 ms | 0.19% | -7756.825 us | -29.22% | FAIL | | F64 | 2^14 | 10 | 2 | 1 | 10 | 127.211 us | 3.62% | 112.724 us | 4.63% | -14.487 us | -11.39% | FAIL | | F64 | 2^28 | 10 | 2 | 1 | 10 | 30.087 ms | 0.31% | 23.886 ms | 0.19% | -6201.201 us | -20.61% | FAIL | | F64 | 2^14 | 1 | 2 | 1 | 100 | 120.433 us | 2.42% | 109.777 us | 4.76% | -10.656 us | -8.85% | FAIL | | F64 | 2^28 | 1 | 2 | 1 | 100 | 26.568 ms | 0.14% | 18.797 ms | 0.18% | -7770.905 us | -29.25% | FAIL | | F64 | 2^14 | 10 | 2 | 1 | 100 | 126.702 us | 3.34% | 112.492 us | 4.27% | -14.210 us | -11.22% | FAIL | | F64 | 2^28 | 10 | 2 | 1 | 100 | 30.106 ms | 0.24% | 24.001 ms | 0.46% | -6105.382 us | -20.28% | FAIL | | F64 | 2^14 | 1 | 2 | 1 | 1000000 | 119.557 us | 4.07% | 110.355 us | 5.38% | -9.203 us | -7.70% | FAIL | | F64 | 2^28 | 1 | 2 | 1 | 1000000 | 26.906 ms | 0.15% | 19.126 ms | 0.18% | -7779.215 us | -28.91% | FAIL | | F64 | 2^14 | 10 | 2 | 1 | 1000000 | 128.546 us | 3.89% | 113.429 us | 2.36% | -15.117 us | -11.76% | FAIL | | F64 | 2^28 | 10 | 2 | 1 | 1000000 | 30.898 ms | 0.18% | 24.737 ms | 0.45% | -6161.068 us | -19.94% | FAIL | | F64 | 2^14 | 1 | 2 | 1 | 100000000 | 120.927 us | 4.36% | 109.903 us | 4.67% | -11.024 us | -9.12% | FAIL | | F64 | 2^28 | 1 | 2 | 1 | 100000000 | 29.997 ms | 0.12% | 21.968 ms | 0.16% | -8028.453 us | -26.76% | FAIL | | F64 | 2^14 | 10 | 2 | 1 | 100000000 | 127.789 us | 3.78% | 112.861 us | 3.81% | -14.928 us | -11.68% | FAIL | | F64 | 2^28 | 10 | 2 | 1 | 100000000 | 36.555 ms | 0.15% | 29.882 ms | 0.12% | -6672.173 us | -18.25% | FAIL | # row_fixed_rolling_sum ## [0] NVIDIA RTX A6000 | T | num_rows | preceding_size | following_size | min_periods | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |-----|------------|------------------|------------------|---------------|------------|-------------|------------|-------------|--------------|---------|----------| | I32 | 2^14 | 1 | 2 | 1 | 31.751 us | 4.87% | 17.603 us | 6.85% | -14.148 us | -44.56% | FAIL | | I32 | 2^22 | 1 | 2 | 1 | 206.365 us | 1.51% | 95.830 us | 2.39% | -110.534 us | -53.56% | FAIL | | I32 | 2^28 | 1 | 2 | 1 | 10.883 ms | 0.20% | 4.767 ms | 0.25% | -6115.576 us | -56.20% | FAIL | | I32 | 2^14 | 10 | 2 | 1 | 31.956 us | 5.91% | 17.559 us | 7.14% | -14.397 us | -45.05% | FAIL | | I32 | 2^22 | 10 | 2 | 1 | 206.654 us | 1.65% | 95.575 us | 2.29% | -111.079 us | -53.75% | FAIL | | I32 | 2^28 | 10 | 2 | 1 | 10.872 ms | 0.12% | 4.758 ms | 0.29% | -6114.139 us | -56.24% | FAIL | | I32 | 2^14 | 100 | 2 | 1 | 40.611 us | 4.58% | 26.636 us | 9.61% | -13.976 us | -34.41% | FAIL | | I32 | 2^22 | 100 | 2 | 1 | 293.156 us | 1.54% | 235.445 us | 2.23% | -57.710 us | -19.69% | FAIL | | I32 | 2^28 | 100 | 2 | 1 | 17.109 ms | 0.42% | 14.613 ms | 0.48% | -2496.488 us | -14.59% | FAIL | | I32 | 2^14 | 1 | 2 | 20 | 32.786 us | 6.05% | 18.115 us | 10.13% | -14.671 us | -44.75% | FAIL | | I32 | 2^22 | 1 | 2 | 20 | 206.530 us | 1.61% | 96.530 us | 2.79% | -109.999 us | -53.26% | FAIL | | I32 | 2^28 | 1 | 2 | 20 | 10.883 ms | 0.16% | 4.767 ms | 0.27% | -6115.283 us | -56.19% | FAIL | | I32 | 2^14 | 10 | 2 | 20 | 31.927 us | 4.21% | 17.784 us | 15.89% | -14.144 us | -44.30% | FAIL | | I32 | 2^22 | 10 | 2 | 20 | 206.093 us | 1.58% | 95.657 us | 2.92% | -110.436 us | -53.59% | FAIL | | I32 | 2^28 | 10 | 2 | 20 | 10.872 ms | 0.12% | 4.757 ms | 0.06% | -6115.435 us | -56.25% | FAIL | | I32 | 2^14 | 100 | 2 | 20 | 40.361 us | 3.87% | 26.359 us | 3.42% | -14.002 us | -34.69% | FAIL | | I32 | 2^22 | 100 | 2 | 20 | 292.868 us | 1.80% | 235.340 us | 2.69% | -57.528 us | -19.64% | FAIL | | I32 | 2^28 | 100 | 2 | 20 | 17.075 ms | 0.47% | 14.613 ms | 0.53% | -2462.145 us | -14.42% | FAIL | | F64 | 2^14 | 1 | 2 | 1 | 35.335 us | 11.59% | 18.771 us | 13.65% | -16.564 us | -46.88% | FAIL | | F64 | 2^22 | 1 | 2 | 1 | 230.322 us | 1.42% | 121.412 us | 3.23% | -108.910 us | -47.29% | FAIL | | F64 | 2^28 | 1 | 2 | 1 | 12.460 ms | 0.09% | 6.311 ms | 0.11% | -6149.789 us | -49.35% | FAIL | | F64 | 2^14 | 10 | 2 | 1 | 39.365 us | 8.03% | 24.712 us | 14.97% | -14.653 us | -37.22% | FAIL | | F64 | 2^22 | 10 | 2 | 1 | 283.241 us | 1.25% | 224.296 us | 1.86% | -58.945 us | -20.81% | FAIL | | F64 | 2^28 | 10 | 2 | 1 | 15.983 ms | 0.13% | 13.039 ms | 0.86% | -2943.223 us | -18.42% | FAIL | | F64 | 2^14 | 100 | 2 | 1 | 46.514 us | 8.38% | 28.323 us | 10.77% | -18.191 us | -39.11% | FAIL | | F64 | 2^22 | 100 | 2 | 1 | 1.739 ms | 0.16% | 1.668 ms | 0.53% | -70.890 us | -4.08% | FAIL | | F64 | 2^28 | 100 | 2 | 1 | 108.509 ms | 0.06% | 104.891 ms | 0.04% | -3618.662 us | -3.33% | FAIL | | F64 | 2^14 | 1 | 2 | 20 | 31.566 us | 6.77% | 17.406 us | 8.84% | -14.160 us | -44.86% | FAIL | | F64 | 2^22 | 1 | 2 | 20 | 230.307 us | 1.47% | 121.044 us | 1.90% | -109.263 us | -47.44% | FAIL | | F64 | 2^28 | 1 | 2 | 20 | 12.462 ms | 0.08% | 6.315 ms | 0.21% | -6147.508 us | -49.33% | FAIL | | F64 | 2^14 | 10 | 2 | 20 | 39.277 us | 6.21% | 23.033 us | 16.63% | -16.244 us | -41.36% | FAIL | | F64 | 2^22 | 10 | 2 | 20 | 281.374 us | 1.09% | 219.307 us | 1.12% | -62.067 us | -22.06% | FAIL | | F64 | 2^28 | 10 | 2 | 20 | 15.944 ms | 0.47% | 12.882 ms | 0.72% | -3062.193 us | -19.21% | FAIL | | F64 | 2^14 | 100 | 2 | 20 | 46.079 us | 8.03% | 28.287 us | 11.55% | -17.792 us | -38.61% | FAIL | | F64 | 2^22 | 100 | 2 | 20 | 1.740 ms | 0.19% | 1.664 ms | 0.36% | -75.602 us | -4.35% | FAIL | | F64 | 2^28 | 100 | 2 | 20 | 108.475 ms | 0.04% | 104.920 ms | 0.04% | -3555.241 us | -3.28% | FAIL | # row_variable_rolling_sum ## [0] NVIDIA RTX A6000 | T | num_rows | preceding_size | following_size | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |-----|------------|------------------|------------------|------------|-------------|------------|-------------|------------|---------|----------| | I32 | 2^14 | 10 | 2 | 20.407 us | 17.95% | 17.981 us | 9.19% | -2.427 us | -11.89% | FAIL | | I32 | 2^22 | 10 | 2 | 145.353 us | 1.96% | 145.458 us | 1.82% | 0.105 us | 0.07% | PASS | | I32 | 2^28 | 10 | 2 | 7.850 ms | 0.17% | 7.863 ms | 0.21% | 12.240 us | 0.16% | PASS | | I32 | 2^14 | 100 | 2 | 26.130 us | 6.01% | 26.324 us | 3.86% | 0.194 us | 0.74% | PASS | | I32 | 2^22 | 100 | 2 | 234.618 us | 2.86% | 245.878 us | 3.30% | 11.260 us | 4.80% | FAIL | | I32 | 2^28 | 100 | 2 | 14.624 ms | 0.79% | 15.221 ms | 0.68% | 596.983 us | 4.08% | FAIL | | F64 | 2^14 | 10 | 2 | 26.256 us | 2.86% | 26.508 us | 6.46% | 0.252 us | 0.96% | PASS | | F64 | 2^22 | 10 | 2 | 223.001 us | 1.96% | 222.131 us | 1.85% | -0.870 us | -0.39% | PASS | | F64 | 2^28 | 10 | 2 | 12.896 ms | 0.58% | 12.893 ms | 0.78% | -3.356 us | -0.03% | PASS | | F64 | 2^14 | 100 | 2 | 33.438 us | 6.71% | 34.323 us | 1.75% | 0.885 us | 2.65% | FAIL | | F64 | 2^22 | 100 | 2 | 1.667 ms | 0.49% | 1.672 ms | 0.26% | 5.138 us | 0.31% | FAIL | | F64 | 2^28 | 100 | 2 | 104.673 ms | 0.05% | 104.982 ms | 0.04% | 309.519 us | 0.30% | FAIL | # Summary - Total Matches: 80 - Pass (diff <= min_noise): 6 - Unknown (infinite noise): 0 - Failure (diff > min_noise): 74 </details> <details> <summary> Before </summary> # Benchmark Results ## row_grouped_rolling_sum ### [0] NVIDIA RTX A6000 | T | num_rows | preceding_size | following_size | min_periods | cardinality | Samples | CPU Time | Noise | GPU Time | Noise | Mrows/s | peak_memory_usage | |-----|------------------|----------------|----------------|-------------|-------------|---------|------------|-------|------------|-------|---------|-------------------| | I32 | 2^14 = 16384 | 1 | 2 | 1 | 10 | 840x | 127.585 us | 5.09% | 123.556 us | 3.90% | 132 | 450.031 KiB | | I32 | 2^28 = 268435456 | 1 | 2 | 1 | 10 | 601x | 24.948 ms | 0.13% | 24.945 ms | 0.13% | 10761 | 7.031 GiB | | I32 | 2^14 = 16384 | 10 | 2 | 1 | 10 | 1020x | 126.638 us | 5.18% | 122.737 us | 4.09% | 133 | 450.031 KiB | | I32 | 2^28 = 268435456 | 10 | 2 | 1 | 10 | 601x | 24.947 ms | 0.18% | 24.944 ms | 0.18% | 10761 | 7.031 GiB | | I32 | 2^14 = 16384 | 1 | 2 | 1 | 100 | 630x | 123.913 us | 5.30% | 119.947 us | 4.07% | 136 | 450.031 KiB | | I32 | 2^28 = 268435456 | 1 | 2 | 1 | 100 | 601x | 24.958 ms | 0.15% | 24.954 ms | 0.15% | 10757 | 7.031 GiB | | I32 | 2^14 = 16384 | 10 | 2 | 1 | 100 | 984x | 127.839 us | 4.98% | 123.919 us | 3.85% | 132 | 450.031 KiB | | I32 | 2^28 = 268435456 | 10 | 2 | 1 | 100 | 601x | 24.958 ms | 0.14% | 24.955 ms | 0.14% | 10756 | 7.031 GiB | | I32 | 2^14 = 16384 | 1 | 2 | 1 | 1000000 | 802x | 125.851 us | 5.65% | 121.896 us | 4.59% | 134 | 450.031 KiB | | I32 | 2^28 = 268435456 | 1 | 2 | 1 | 1000000 | 593x | 25.294 ms | 0.15% | 25.291 ms | 0.15% | 10613 | 7.031 GiB | | I32 | 2^14 = 16384 | 10 | 2 | 1 | 1000000 | 992x | 130.416 us | 4.81% | 126.450 us | 3.65% | 129 | 450.031 KiB | | I32 | 2^28 = 268435456 | 10 | 2 | 1 | 1000000 | 593x | 25.298 ms | 0.14% | 25.295 ms | 0.13% | 10612 | 7.031 GiB | | I32 | 2^14 = 16384 | 1 | 2 | 1 | 100000000 | 820x | 126.300 us | 5.56% | 122.268 us | 4.42% | 134 | 450.031 KiB | | I32 | 2^28 = 268435456 | 1 | 2 | 1 | 100000000 | 528x | 28.413 ms | 0.12% | 28.409 ms | 0.12% | 9448 | 7.031 GiB | | I32 | 2^14 = 16384 | 10 | 2 | 1 | 100000000 | 752x | 131.420 us | 4.30% | 127.440 us | 2.96% | 128 | 450.031 KiB | | I32 | 2^28 = 268435456 | 10 | 2 | 1 | 100000000 | 528x | 28.398 ms | 0.12% | 28.395 ms | 0.12% | 9453 | 7.031 GiB | | F64 | 2^14 = 16384 | 1 | 2 | 1 | 10 | 860x | 123.857 us | 5.40% | 119.900 us | 4.27% | 136 | 450.031 KiB | | F64 | 2^28 = 268435456 | 1 | 2 | 1 | 10 | 565x | 26.551 ms | 0.14% | 26.547 ms | 0.14% | 10111 | 7.031 GiB | | F64 | 2^14 = 16384 | 10 | 2 | 1 | 10 | 706x | 131.193 us | 4.81% | 127.211 us | 3.62% | 128 | 450.031 KiB | | F64 | 2^28 = 268435456 | 10 | 2 | 1 | 10 | 499x | 30.091 ms | 0.31% | 30.087 ms | 0.31% | 8921 | 7.031 GiB | | F64 | 2^14 = 16384 | 1 | 2 | 1 | 100 | 598x | 124.407 us | 4.10% | 120.433 us | 2.42% | 136 | 450.031 KiB | | F64 | 2^28 = 268435456 | 1 | 2 | 1 | 100 | 565x | 26.572 ms | 0.14% | 26.568 ms | 0.14% | 10103 | 7.031 GiB | | F64 | 2^14 = 16384 | 10 | 2 | 1 | 100 | 778x | 130.659 us | 4.55% | 126.702 us | 3.34% | 129 | 450.031 KiB | | F64 | 2^28 = 268435456 | 10 | 2 | 1 | 100 | 498x | 30.109 ms | 0.25% | 30.106 ms | 0.24% | 8916 | 7.031 GiB | | F64 | 2^14 = 16384 | 1 | 2 | 1 | 1000000 | 716x | 123.521 us | 5.27% | 119.557 us | 4.07% | 137 | 450.031 KiB | | F64 | 2^28 = 268435456 | 1 | 2 | 1 | 1000000 | 557x | 26.909 ms | 0.15% | 26.906 ms | 0.15% | 9976 | 7.031 GiB | | F64 | 2^14 = 16384 | 10 | 2 | 1 | 1000000 | 802x | 132.551 us | 5.00% | 128.546 us | 3.89% | 127 | 450.031 KiB | | F64 | 2^28 = 268435456 | 10 | 2 | 1 | 1000000 | 486x | 30.902 ms | 0.18% | 30.898 ms | 0.18% | 8687 | 7.031 GiB | | F64 | 2^14 = 16384 | 1 | 2 | 1 | 100000000 | 818x | 124.957 us | 5.54% | 120.927 us | 4.36% | 135 | 450.031 KiB | | F64 | 2^28 = 268435456 | 1 | 2 | 1 | 100000000 | 500x | 30.000 ms | 0.12% | 29.997 ms | 0.12% | 8948 | 7.031 GiB | | F64 | 2^14 = 16384 | 10 | 2 | 1 | 100000000 | 810x | 131.756 us | 4.90% | 127.789 us | 3.78% | 128 | 450.031 KiB | | F64 | 2^28 = 268435456 | 10 | 2 | 1 | 100000000 | 411x | 36.558 ms | 0.15% | 36.555 ms | 0.15% | 7343 | 7.031 GiB | ## row_fixed_rolling_sum ### [0] NVIDIA RTX A6000 | T | num_rows | preceding_size | following_size | min_periods | Samples | CPU Time | Noise | GPU Time | Noise | Mrows/s | peak_memory_usage | |-----|------------------|----------------|----------------|-------------|---------|------------|--------|------------|--------|---------|-------------------| | I32 | 2^14 = 16384 | 1 | 2 | 1 | 404x | 35.776 us | 13.79% | 31.751 us | 4.87% | 516 | 258.016 KiB | | I32 | 2^22 = 4194304 | 1 | 2 | 1 | 760x | 210.368 us | 2.49% | 206.365 us | 1.51% | 20324 | 64.500 MiB | | I32 | 2^28 = 268435456 | 1 | 2 | 1 | 1032x | 10.886 ms | 0.20% | 10.883 ms | 0.20% | 24666 | 4.031 GiB | | I32 | 2^14 = 16384 | 10 | 2 | 1 | 382x | 35.926 us | 13.76% | 31.956 us | 5.91% | 512 | 258.016 KiB | | I32 | 2^22 = 4194304 | 10 | 2 | 1 | 780x | 210.618 us | 2.52% | 206.654 us | 1.65% | 20296 | 64.500 MiB | | I32 | 2^28 = 268435456 | 10 | 2 | 1 | 938x | 10.876 ms | 0.12% | 10.872 ms | 0.12% | 24690 | 4.031 GiB | | I32 | 2^14 = 16384 | 100 | 2 | 1 | 370x | 44.559 us | 10.75% | 40.611 us | 4.58% | 403 | 258.016 KiB | | I32 | 2^22 = 4194304 | 100 | 2 | 1 | 782x | 297.286 us | 2.11% | 293.156 us | 1.54% | 14307 | 64.500 MiB | | I32 | 2^28 = 268435456 | 100 | 2 | 1 | 876x | 17.113 ms | 0.42% | 17.109 ms | 0.42% | 15689 | 4.031 GiB | | I32 | 2^14 = 16384 | 1 | 2 | 20 | 568x | 36.961 us | 14.10% | 32.786 us | 6.05% | 499 | 258.016 KiB | | I32 | 2^22 = 4194304 | 1 | 2 | 20 | 808x | 210.503 us | 2.51% | 206.530 us | 1.61% | 20308 | 64.500 MiB | | I32 | 2^28 = 268435456 | 1 | 2 | 20 | 1124x | 10.886 ms | 0.16% | 10.883 ms | 0.16% | 24666 | 4.031 GiB | | I32 | 2^14 = 16384 | 10 | 2 | 20 | 408x | 35.851 us | 12.88% | 31.927 us | 4.21% | 513 | 258.016 KiB | | I32 | 2^22 = 4194304 | 10 | 2 | 20 | 756x | 210.040 us | 2.50% | 206.093 us | 1.58% | 20351 | 64.500 MiB | | I32 | 2^28 = 268435456 | 10 | 2 | 20 | 1004x | 10.876 ms | 0.13% | 10.872 ms | 0.12% | 24690 | 4.031 GiB | | I32 | 2^14 = 16384 | 100 | 2 | 20 | 496x | 44.365 us | 10.65% | 40.361 us | 3.87% | 405 | 258.016 KiB | | I32 | 2^22 = 4194304 | 100 | 2 | 20 | 694x | 296.938 us | 2.29% | 292.868 us | 1.80% | 14321 | 64.500 MiB | | I32 | 2^28 = 268435456 | 100 | 2 | 20 | 878x | 17.079 ms | 0.47% | 17.075 ms | 0.47% | 15720 | 4.031 GiB | | F64 | 2^14 = 16384 | 1 | 2 | 1 | 622x | 39.469 us | 16.54% | 35.335 us | 11.59% | 463 | 258.016 KiB | | F64 | 2^22 = 4194304 | 1 | 2 | 1 | 784x | 234.374 us | 2.31% | 230.322 us | 1.42% | 18210 | 64.500 MiB | | F64 | 2^28 = 268435456 | 1 | 2 | 1 | 1060x | 12.464 ms | 0.09% | 12.460 ms | 0.09% | 21543 | 4.031 GiB | | F64 | 2^14 = 16384 | 10 | 2 | 1 | 434x | 43.384 us | 13.04% | 39.365 us | 8.03% | 416 | 258.016 KiB | | F64 | 2^22 = 4194304 | 10 | 2 | 1 | 648x | 287.190 us | 1.86% | 283.241 us | 1.25% | 14808 | 64.500 MiB | | F64 | 2^28 = 268435456 | 10 | 2 | 1 | 938x | 15.986 ms | 0.13% | 15.983 ms | 0.13% | 16795 | 4.031 GiB | | F64 | 2^14 = 16384 | 100 | 2 | 1 | 1006x | 50.493 us | 11.98% | 46.514 us | 8.38% | 352 | 258.016 KiB | | F64 | 2^22 = 4194304 | 100 | 2 | 1 | 678x | 1.743 ms | 0.28% | 1.739 ms | 0.16% | 2411 | 64.500 MiB | | F64 | 2^28 = 268435456 | 100 | 2 | 1 | 139x | 108.511 ms | 0.06% | 108.509 ms | 0.06% | 2473 | 4.031 GiB | | F64 | 2^14 = 16384 | 1 | 2 | 20 | 572x | 35.501 us | 14.15% | 31.566 us | 6.77% | 519 | 258.016 KiB | | F64 | 2^22 = 4194304 | 1 | 2 | 20 | 792x | 234.279 us | 2.27% | 230.307 us | 1.47% | 18211 | 64.500 MiB | | F64 | 2^28 = 268435456 | 1 | 2 | 20 | 1112x | 12.466 ms | 0.09% | 12.462 ms | 0.08% | 21539 | 4.031 GiB | | F64 | 2^14 = 16384 | 10 | 2 | 20 | 430x | 43.210 us | 11.79% | 39.277 us | 6.21% | 417 | 258.016 KiB | | F64 | 2^22 = 4194304 | 10 | 2 | 20 | 690x | 285.352 us | 1.80% | 281.374 us | 1.09% | 14906 | 64.500 MiB | | F64 | 2^28 = 268435456 | 10 | 2 | 20 | 886x | 15.948 ms | 0.47% | 15.944 ms | 0.47% | 16835 | 4.031 GiB | | F64 | 2^14 = 16384 | 100 | 2 | 20 | 618x | 50.056 us | 11.92% | 46.079 us | 8.03% | 355 | 258.016 KiB | | F64 | 2^22 = 4194304 | 100 | 2 | 20 | 710x | 1.744 ms | 0.30% | 1.740 ms | 0.19% | 2411 | 64.500 MiB | | F64 | 2^28 = 268435456 | 100 | 2 | 20 | 139x | 108.477 ms | 0.04% | 108.475 ms | 0.04% | 2474 | 4.031 GiB | ## row_variable_rolling_sum ### [0] NVIDIA RTX A6000 | T | num_rows | preceding_size | following_size | Samples | CPU Time | Noise | GPU Time | Noise | Mrows/s | peak_memory_usage | |-----|------------------|----------------|----------------|---------|------------|--------|------------|--------|---------|-------------------| | I32 | 2^14 = 16384 | 10 | 2 | 436x | 24.361 us | 26.44% | 20.407 us | 17.95% | 802 | 130.016 KiB | | I32 | 2^22 = 4194304 | 10 | 2 | 714x | 149.553 us | 5.30% | 145.353 us | 1.96% | 28856 | 32.500 MiB | | I32 | 2^28 = 268435456 | 10 | 2 | 974x | 7.854 ms | 0.17% | 7.850 ms | 0.17% | 34193 | 2.031 GiB | | I32 | 2^14 = 16384 | 100 | 2 | 364x | 30.150 us | 16.63% | 26.130 us | 6.01% | 627 | 130.016 KiB | | I32 | 2^22 = 4194304 | 100 | 2 | 628x | 238.815 us | 3.44% | 234.618 us | 2.86% | 17877 | 32.500 MiB | | I32 | 2^28 = 268435456 | 100 | 2 | 624x | 14.628 ms | 0.79% | 14.624 ms | 0.79% | 18355 | 2.031 GiB | | F64 | 2^14 = 16384 | 10 | 2 | 440x | 30.546 us | 16.87% | 26.256 us | 2.86% | 624 | 130.016 KiB | | F64 | 2^22 = 4194304 | 10 | 2 | 574x | 227.037 us | 2.68% | 223.001 us | 1.96% | 18808 | 32.500 MiB | | F64 | 2^28 = 268435456 | 10 | 2 | 930x | 12.900 ms | 0.58% | 12.896 ms | 0.58% | 20814 | 2.031 GiB | | F64 | 2^14 = 16384 | 100 | 2 | 412x | 37.479 us | 13.85% | 33.438 us | 6.71% | 489 | 130.016 KiB | | F64 | 2^22 = 4194304 | 100 | 2 | 792x | 1.671 ms | 0.55% | 1.667 ms | 0.49% | 2516 | 32.500 MiB | | F64 | 2^28 = 268435456 | 100 | 2 | 144x | 104.675 ms | 0.05% | 104.673 ms | 0.05% | 2564 | 2.031 GiB | </details> <details> <summary> After </summary> # Benchmark Results ## row_grouped_rolling_sum ### [0] NVIDIA RTX A6000 | T | num_rows | preceding_size | following_size | min_periods | cardinality | Samples | CPU Time | Noise | GPU Time | Noise | Mrows/s | peak_memory_usage | |-----|------------------|----------------|----------------|-------------|-------------|---------|------------|--------|------------|--------|---------|-------------------| | I32 | 2^14 = 16384 | 1 | 2 | 1 | 10 | 610x | 116.114 us | 4.38% | 112.111 us | 2.55% | 146 | 322.031 KiB | | I32 | 2^28 = 268435456 | 1 | 2 | 1 | 10 | 875x | 17.128 ms | 0.22% | 17.124 ms | 0.22% | 15676 | 5.031 GiB | | I32 | 2^14 = 16384 | 10 | 2 | 1 | 10 | 1038x | 113.661 us | 12.55% | 109.758 us | 12.03% | 149 | 322.031 KiB | | I32 | 2^28 = 268435456 | 10 | 2 | 1 | 10 | 874x | 17.147 ms | 0.19% | 17.143 ms | 0.19% | 15658 | 5.031 GiB | | I32 | 2^14 = 16384 | 1 | 2 | 1 | 100 | 784x | 112.206 us | 5.86% | 108.231 us | 4.51% | 151 | 322.031 KiB | | I32 | 2^28 = 268435456 | 1 | 2 | 1 | 100 | 874x | 17.144 ms | 0.19% | 17.140 ms | 0.19% | 15661 | 5.031 GiB | | I32 | 2^14 = 16384 | 10 | 2 | 1 | 100 | 742x | 114.504 us | 5.52% | 110.542 us | 4.21% | 148 | 322.031 KiB | | I32 | 2^28 = 268435456 | 10 | 2 | 1 | 100 | 872x | 17.189 ms | 0.19% | 17.185 ms | 0.19% | 15620 | 5.031 GiB | | I32 | 2^14 = 16384 | 1 | 2 | 1 | 1000000 | 660x | 113.501 us | 5.42% | 109.536 us | 4.03% | 149 | 322.031 KiB | | I32 | 2^28 = 268435456 | 1 | 2 | 1 | 1000000 | 857x | 17.490 ms | 0.30% | 17.486 ms | 0.30% | 15351 | 5.031 GiB | | I32 | 2^14 = 16384 | 10 | 2 | 1 | 1000000 | 596x | 115.135 us | 5.20% | 111.178 us | 3.78% | 147 | 322.031 KiB | | I32 | 2^28 = 268435456 | 10 | 2 | 1 | 1000000 | 851x | 17.608 ms | 0.18% | 17.604 ms | 0.18% | 15248 | 5.031 GiB | | I32 | 2^14 = 16384 | 1 | 2 | 1 | 100000000 | 646x | 113.959 us | 4.91% | 109.990 us | 3.34% | 148 | 322.031 KiB | | I32 | 2^28 = 268435456 | 1 | 2 | 1 | 100000000 | 734x | 20.436 ms | 0.19% | 20.432 ms | 0.18% | 13138 | 5.031 GiB | | I32 | 2^14 = 16384 | 10 | 2 | 1 | 100000000 | 418x | 116.388 us | 4.26% | 112.387 us | 2.31% | 145 | 322.031 KiB | | I32 | 2^28 = 268435456 | 10 | 2 | 1 | 100000000 | 721x | 20.796 ms | 0.15% | 20.792 ms | 0.15% | 12910 | 5.031 GiB | | F64 | 2^14 = 16384 | 1 | 2 | 1 | 10 | 744x | 113.892 us | 5.18% | 109.941 us | 3.75% | 149 | 322.031 KiB | | F64 | 2^28 = 268435456 | 1 | 2 | 1 | 10 | 797x | 18.798 ms | 0.17% | 18.794 ms | 0.17% | 14283 | 5.031 GiB | | F64 | 2^14 = 16384 | 10 | 2 | 1 | 10 | 694x | 114.918 us | 5.14% | 110.968 us | 3.69% | 147 | 322.031 KiB | | F64 | 2^28 = 268435456 | 10 | 2 | 1 | 10 | 626x | 23.972 ms | 0.41% | 23.968 ms | 0.41% | 11199 | 5.031 GiB | | F64 | 2^14 = 16384 | 1 | 2 | 1 | 100 | 754x | 114.013 us | 5.51% | 110.094 us | 4.22% | 148 | 322.031 KiB | | F64 | 2^28 = 268435456 | 1 | 2 | 1 | 100 | 797x | 18.809 ms | 0.19% | 18.805 ms | 0.18% | 14274 | 5.031 GiB | | F64 | 2^14 = 16384 | 10 | 2 | 1 | 100 | 624x | 114.799 us | 5.12% | 110.850 us | 3.64% | 147 | 322.031 KiB | | F64 | 2^28 = 268435456 | 10 | 2 | 1 | 100 | 624x | 24.035 ms | 0.53% | 24.031 ms | 0.53% | 11170 | 5.031 GiB | | F64 | 2^14 = 16384 | 1 | 2 | 1 | 1000000 | 588x | 114.699 us | 5.00% | 110.743 us | 3.46% | 147 | 322.031 KiB | | F64 | 2^28 = 268435456 | 1 | 2 | 1 | 1000000 | 783x | 19.138 ms | 0.17% | 19.134 ms | 0.17% | 14029 | 5.031 GiB | | F64 | 2^14 = 16384 | 10 | 2 | 1 | 1000000 | 560x | 114.636 us | 5.26% | 110.662 us | 3.82% | 148 | 322.031 KiB | | F64 | 2^28 = 268435456 | 10 | 2 | 1 | 1000000 | 606x | 24.757 ms | 0.45% | 24.753 ms | 0.45% | 10844 | 5.031 GiB | | F64 | 2^14 = 16384 | 1 | 2 | 1 | 100000000 | 780x | 113.344 us | 5.66% | 109.376 us | 4.32% | 149 | 322.031 KiB | | F64 | 2^28 = 268435456 | 1 | 2 | 1 | 100000000 | 682x | 21.974 ms | 0.15% | 21.970 ms | 0.15% | 12218 | 5.031 GiB | | F64 | 2^14 = 16384 | 10 | 2 | 1 | 100000000 | 568x | 115.521 us | 4.91% | 111.565 us | 3.34% | 146 | 322.031 KiB | | F64 | 2^28 = 268435456 | 10 | 2 | 1 | 100000000 | 502x | 29.905 ms | 0.19% | 29.901 ms | 0.19% | 8977 | 5.031 GiB | ## row_fixed_rolling_sum ### [0] NVIDIA RTX A6000 | T | num_rows | preceding_size | following_size | min_periods | Samples | CPU Time | Noise | GPU Time | Noise | Mrows/s | peak_memory_usage | |-----|------------------|----------------|----------------|-------------|---------|------------|--------|------------|--------|---------|-------------------| | I32 | 2^14 = 16384 | 1 | 2 | 1 | 474x | 21.860 us | 26.56% | 17.797 us | 12.02% | 920 | 130.016 KiB | | I32 | 2^22 = 4194304 | 1 | 2 | 1 | 728x | 100.117 us | 5.62% | 96.093 us | 3.38% | 43648 | 32.500 MiB | | I32 | 2^28 = 268435456 | 1 | 2 | 1 | 824x | 4.772 ms | 0.21% | 4.768 ms | 0.19% | 56303 | 2.031 GiB | | I32 | 2^14 = 16384 | 10 | 2 | 1 | 424x | 21.741 us | 26.94% | 17.721 us | 14.31% | 924 | 130.016 KiB | | I32 | 2^22 = 4194304 | 10 | 2 | 1 | 660x | 99.712 us | 5.91% | 95.725 us | 4.13% | 43816 | 32.500 MiB | | I32 | 2^28 = 268435456 | 10 | 2 | 1 | 962x | 4.763 ms | 0.28% | 4.759 ms | 0.26% | 56409 | 2.031 GiB | | I32 | 2^14 = 16384 | 100 | 2 | 1 | 592x | 30.530 us | 19.38% | 26.492 us | 11.86% | 618 | 130.016 KiB | | I32 | 2^22 = 4194304 | 100 | 2 | 1 | 558x | 238.409 us | 2.91% | 234.292 us | 2.29% | 17902 | 32.500 MiB | | I32 | 2^28 = 268435456 | 100 | 2 | 1 | 814x | 14.617 ms | 0.47% | 14.612 ms | 0.47% | 18370 | 2.031 GiB | | I32 | 2^14 = 16384 | 1 | 2 | 20 | 474x | 22.231 us | 24.96% | 17.994 us | 7.92% | 910 | 130.016 KiB | | I32 | 2^22 = 4194304 | 1 | 2 | 20 | 776x | 99.747 us | 4.85% | 95.737 us | 2.39% | 43810 | 32.500 MiB | | I32 | 2^28 = 268435456 | 1 | 2 | 20 | 866x | 4.771 ms | 0.11% | 4.767 ms | 0.08% | 56317 | 2.031 GiB | | I32 | 2^14 = 16384 | 10 | 2 | 20 | 404x | 21.898 us | 25.01% | 17.827 us | 9.79% | 919 | 130.016 KiB | | I32 | 2^22 = 4194304 | 10 | 2 | 20 | 692x | 99.264 us | 4.72% | 95.282 us | 2.15% | 44019 | 32.500 MiB | | I32 | 2^28 = 268435456 | 10 | 2 | 20 | 706x | 4.762 ms | 0.29% | 4.758 ms | 0.28% | 56419 | 2.031 GiB | | I32 | 2^14 = 16384 | 100 | 2 | 20 | 584x | 30.366 us | 15.81% | 26.306 us | 3.31% | 622 | 130.016 KiB | | I32 | 2^22 = 4194304 | 100 | 2 | 20 | 540x | 237.905 us | 2.62% | 233.802 us | 1.91% | 17939 | 32.500 MiB | | I32 | 2^28 = 268435456 | 100 | 2 | 20 | 616x | 14.618 ms | 0.50% | 14.614 ms | 0.49% | 18368 | 2.031 GiB | | F64 | 2^14 = 16384 | 1 | 2 | 1 | 414x | 23.969 us | 28.36% | 19.710 us | 17.67% | 831 | 130.016 KiB | | F64 | 2^22 = 4194304 | 1 | 2 | 1 | 716x | 125.282 us | 4.01% | 121.248 us | 2.18% | 34592 | 32.500 MiB | | F64 | 2^28 = 268435456 | 1 | 2 | 1 | 1082x | 6.315 ms | 0.22% | 6.311 ms | 0.21% | 42537 | 2.031 GiB | | F64 | 2^14 = 16384 | 10 | 2 | 1 | 458x | 29.235 us | 19.23% | 25.181 us | 10.50% | 650 | 130.016 KiB | | F64 | 2^22 = 4194304 | 10 | 2 | 1 | 668x | 229.256 us | 2.39% | 225.265 us | 1.60% | 18619 | 32.500 MiB | | F64 | 2^28 = 268435456 | 10 | 2 | 1 | 1145x | 13.088 ms | 0.94% | 13.084 ms | 0.94% | 20516 | 2.031 GiB | | F64 | 2^14 = 16384 | 100 | 2 | 1 | 470x | 36.907 us | 14.33% | 32.953 us | 8.02% | 497 | 130.016 KiB | | F64 | 2^22 = 4194304 | 100 | 2 | 1 | 704x | 1.699 ms | 1.58% | 1.695 ms | 1.56% | 2473 | 32.500 MiB | | F64 | 2^28 = 268435456 | 100 | 2 | 1 | 143x | 105.210 ms | 0.37% | 105.206 ms | 0.37% | 2551 | 2.031 GiB | | F64 | 2^14 = 16384 | 1 | 2 | 20 | 416x | 21.304 us | 24.50% | 17.376 us | 9.38% | 942 | 130.016 KiB | | F64 | 2^22 = 4194304 | 1 | 2 | 20 | 584x | 124.586 us | 3.85% | 120.627 us | 1.91% | 34770 | 32.500 MiB | | F64 | 2^28 = 268435456 | 1 | 2 | 20 | 998x | 6.318 ms | 0.23% | 6.314 ms | 0.22% | 42516 | 2.031 GiB | | F64 | 2^14 = 16384 | 10 | 2 | 20 | 632x | 28.582 us | 20.04% | 24.602 us | 11.91% | 665 | 130.016 KiB | | F64 | 2^22 = 4194304 | 10 | 2 | 20 | 664x | 222.036 us | 2.17% | 218.110 us | 1.20% | 19230 | 32.500 MiB | | F64 | 2^28 = 268435456 | 10 | 2 | 20 | 1000x | 12.880 ms | 0.76% | 12.876 ms | 0.76% | 20847 | 2.031 GiB | | F64 | 2^14 = 16384 | 100 | 2 | 20 | 356x | 33.704 us | 17.99% | 29.671 us | 11.54% | 552 | 130.016 KiB | | F64 | 2^22 = 4194304 | 100 | 2 | 20 | 528x | 1.668 ms | 0.47% | 1.664 ms | 0.40% | 2520 | 32.500 MiB | | F64 | 2^28 = 268435456 | 100 | 2 | 20 | 143x | 104.902 ms | 0.04% | 104.898 ms | 0.04% | 2559 | 2.031 GiB | ## row_variable_rolling_sum ### [0] NVIDIA RTX A6000 | T | num_rows | preceding_size | following_size | Samples | CPU Time | Noise | GPU Time | Noise | Mrows/s | peak_memory_usage | |-----|------------------|----------------|----------------|---------|------------|--------|------------|--------|---------|-------------------| | I32 | 2^14 = 16384 | 10 | 2 | 392x | 21.847 us | 24.65% | 17.969 us | 12.83% | 911 | 130.016 KiB | | I32 | 2^22 = 4194304 | 10 | 2 | 786x | 149.751 us | 3.19% | 145.831 us | 1.77% | 28761 | 32.500 MiB | | I32 | 2^28 = 268435456 | 10 | 2 | 1186x | 7.867 ms | 0.21% | 7.863 ms | 0.20% | 34140 | 2.031 GiB | | I32 | 2^14 = 16384 | 100 | 2 | 406x | 29.974 us | 16.38% | 26.012 us | 5.87% | 629 | 130.016 KiB | | I32 | 2^22 = 4194304 | 100 | 2 | 588x | 246.351 us | 3.57% | 242.230 us | 3.12% | 17315 | 32.500 MiB | | I32 | 2^28 = 268435456 | 100 | 2 | 612x | 15.222 ms | 0.75% | 15.218 ms | 0.75% | 17639 | 2.031 GiB | | F64 | 2^14 = 16384 | 10 | 2 | 410x | 30.346 us | 16.77% | 26.087 us | 3.69% | 628 | 130.016 KiB | | F64 | 2^22 = 4194304 | 10 | 2 | 632x | 226.719 us | 2.91% | 222.736 us | 2.26% | 18830 | 32.500 MiB | | F64 | 2^28 = 268435456 | 10 | 2 | 1058x | 12.882 ms | 0.82% | 12.877 ms | 0.82% | 20845 | 2.031 GiB | | F64 | 2^14 = 16384 | 100 | 2 | 414x | 38.118 us | 12.55% | 34.123 us | 4.52% | 480 | 130.016 KiB | | F64 | 2^22 = 4194304 | 100 | 2 | 658x | 1.677 ms | 0.33% | 1.673 ms | 0.24% | 2507 | 32.500 MiB | | F64 | 2^28 = 268435456 | 100 | 2 | 143x | 104.991 ms | 0.04% | 104.987 ms | 0.04% | 2556 | 2.031 GiB | </details> Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - MithunR (https://github.com/mythrocks) - Muhammad Haseeb (https://github.com/mhaseeb123) - Shruti Shivakumar (https://github.com/shrshi) URL: #17623
Description
Previously we did not have any benchmarks for rolling aggregations. Introduce some, so we can measure the effects of any performance improvements we might make.
Checklist