Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements and simplifications for fixed size row-based rolling windows #17623

Merged
merged 23 commits into from
Jan 23, 2025

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Dec 18, 2024

Description

Having introduced benchmarks of rolling window performance in #17613, we can now look at what happens when cleaning up some the implementation.

Breaking change

Previously, the most general "variable-sized" rolling window function applied a transform iterator (sometimes multiple times) to the user-provided preceding and following columns. We now require, and document, that the user provides values that are always in bounds.

This simplifies the book-keeping, and allows us to remove some (now redundant) checks from the rolling kernel.

Performance improvements

In various places, we accept fixed size integers as row-based offsets for rolling windows. We used to materialise these into columns representing the clamped preceding and following columns. However, we did so inconsistently with respect to which side we would clamp (after the change to allow negative window offsets).

To fix this, rationalise this clamping in one place, and just use a transform iterator. We now do this consistently for both grouped and ungrouped fixed-size windows.

Benchmark results

Comparing da4accc (the tip of #17613) and this branch:

nvbench compare output

Headline numbers:

  • group-aware fixed rolling windows between 10 and 30% faster, use 30% less memory
  • ungrouped fixed rolling windows between 10 and 50% faster, use 50% less memory
  • ungrouped variable rolling windows basically no change (so no regression)
All the data

latest/_deps/nvbench-src/scripts/nvbench_compare.py da4accc.json c3973aa36c373e013c6c90f8d7bd58fe0aed6880.json
['da4acccf6b76a93b1e9395b51950a17c45da99c3.json', 'c3973aa36c373e013c6c90f8d7bd58fe0aed6880.json']

row_grouped_rolling_sum

[0] NVIDIA RTX A6000

T num_rows preceding_size following_size min_periods cardinality Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I32 2^14 1 2 1 10 123.556 us 3.90% 111.199 us 4.09% -12.357 us -10.00% FAIL
I32 2^28 1 2 1 10 24.945 ms 0.13% 17.126 ms 0.26% -7819.548 us -31.35% FAIL
I32 2^14 10 2 1 10 122.737 us 4.09% 109.055 us 4.40% -13.682 us -11.15% FAIL
I32 2^28 10 2 1 10 24.944 ms 0.18% 17.151 ms 0.20% -7793.093 us -31.24% FAIL
I32 2^14 1 2 1 100 119.947 us 4.07% 108.432 us 5.18% -11.515 us -9.60% FAIL
I32 2^28 1 2 1 100 24.954 ms 0.15% 17.149 ms 0.20% -7805.558 us -31.28% FAIL
I32 2^14 10 2 1 100 123.919 us 3.85% 109.812 us 4.63% -14.107 us -11.38% FAIL
I32 2^28 10 2 1 100 24.955 ms 0.14% 17.173 ms 0.18% -7782.547 us -31.19% FAIL
I32 2^14 1 2 1 1000000 121.896 us 4.59% 109.115 us 4.57% -12.781 us -10.49% FAIL
I32 2^28 1 2 1 1000000 25.291 ms 0.15% 17.460 ms 0.21% -7830.713 us -30.96% FAIL
I32 2^14 10 2 1 1000000 126.450 us 3.65% 111.787 us 3.37% -14.662 us -11.60% FAIL
I32 2^28 10 2 1 1000000 25.295 ms 0.13% 17.593 ms 0.19% -7701.923 us -30.45% FAIL
I32 2^14 1 2 1 100000000 122.268 us 4.42% 109.338 us 5.17% -12.930 us -10.57% FAIL
I32 2^28 1 2 1 100000000 28.409 ms 0.12% 20.422 ms 0.16% -7987.039 us -28.11% FAIL
I32 2^14 10 2 1 100000000 127.440 us 2.96% 113.009 us 4.54% -14.431 us -11.32% FAIL
I32 2^28 10 2 1 100000000 28.395 ms 0.12% 20.789 ms 0.17% -7606.146 us -26.79% FAIL
F64 2^14 1 2 1 10 119.900 us 4.27% 110.024 us 5.21% -9.876 us -8.24% FAIL
F64 2^28 1 2 1 10 26.547 ms 0.14% 18.790 ms 0.19% -7756.825 us -29.22% FAIL
F64 2^14 10 2 1 10 127.211 us 3.62% 112.724 us 4.63% -14.487 us -11.39% FAIL
F64 2^28 10 2 1 10 30.087 ms 0.31% 23.886 ms 0.19% -6201.201 us -20.61% FAIL
F64 2^14 1 2 1 100 120.433 us 2.42% 109.777 us 4.76% -10.656 us -8.85% FAIL
F64 2^28 1 2 1 100 26.568 ms 0.14% 18.797 ms 0.18% -7770.905 us -29.25% FAIL
F64 2^14 10 2 1 100 126.702 us 3.34% 112.492 us 4.27% -14.210 us -11.22% FAIL
F64 2^28 10 2 1 100 30.106 ms 0.24% 24.001 ms 0.46% -6105.382 us -20.28% FAIL
F64 2^14 1 2 1 1000000 119.557 us 4.07% 110.355 us 5.38% -9.203 us -7.70% FAIL
F64 2^28 1 2 1 1000000 26.906 ms 0.15% 19.126 ms 0.18% -7779.215 us -28.91% FAIL
F64 2^14 10 2 1 1000000 128.546 us 3.89% 113.429 us 2.36% -15.117 us -11.76% FAIL
F64 2^28 10 2 1 1000000 30.898 ms 0.18% 24.737 ms 0.45% -6161.068 us -19.94% FAIL
F64 2^14 1 2 1 100000000 120.927 us 4.36% 109.903 us 4.67% -11.024 us -9.12% FAIL
F64 2^28 1 2 1 100000000 29.997 ms 0.12% 21.968 ms 0.16% -8028.453 us -26.76% FAIL
F64 2^14 10 2 1 100000000 127.789 us 3.78% 112.861 us 3.81% -14.928 us -11.68% FAIL
F64 2^28 10 2 1 100000000 36.555 ms 0.15% 29.882 ms 0.12% -6672.173 us -18.25% FAIL

row_fixed_rolling_sum

[0] NVIDIA RTX A6000

T num_rows preceding_size following_size min_periods Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I32 2^14 1 2 1 31.751 us 4.87% 17.603 us 6.85% -14.148 us -44.56% FAIL
I32 2^22 1 2 1 206.365 us 1.51% 95.830 us 2.39% -110.534 us -53.56% FAIL
I32 2^28 1 2 1 10.883 ms 0.20% 4.767 ms 0.25% -6115.576 us -56.20% FAIL
I32 2^14 10 2 1 31.956 us 5.91% 17.559 us 7.14% -14.397 us -45.05% FAIL
I32 2^22 10 2 1 206.654 us 1.65% 95.575 us 2.29% -111.079 us -53.75% FAIL
I32 2^28 10 2 1 10.872 ms 0.12% 4.758 ms 0.29% -6114.139 us -56.24% FAIL
I32 2^14 100 2 1 40.611 us 4.58% 26.636 us 9.61% -13.976 us -34.41% FAIL
I32 2^22 100 2 1 293.156 us 1.54% 235.445 us 2.23% -57.710 us -19.69% FAIL
I32 2^28 100 2 1 17.109 ms 0.42% 14.613 ms 0.48% -2496.488 us -14.59% FAIL
I32 2^14 1 2 20 32.786 us 6.05% 18.115 us 10.13% -14.671 us -44.75% FAIL
I32 2^22 1 2 20 206.530 us 1.61% 96.530 us 2.79% -109.999 us -53.26% FAIL
I32 2^28 1 2 20 10.883 ms 0.16% 4.767 ms 0.27% -6115.283 us -56.19% FAIL
I32 2^14 10 2 20 31.927 us 4.21% 17.784 us 15.89% -14.144 us -44.30% FAIL
I32 2^22 10 2 20 206.093 us 1.58% 95.657 us 2.92% -110.436 us -53.59% FAIL
I32 2^28 10 2 20 10.872 ms 0.12% 4.757 ms 0.06% -6115.435 us -56.25% FAIL
I32 2^14 100 2 20 40.361 us 3.87% 26.359 us 3.42% -14.002 us -34.69% FAIL
I32 2^22 100 2 20 292.868 us 1.80% 235.340 us 2.69% -57.528 us -19.64% FAIL
I32 2^28 100 2 20 17.075 ms 0.47% 14.613 ms 0.53% -2462.145 us -14.42% FAIL
F64 2^14 1 2 1 35.335 us 11.59% 18.771 us 13.65% -16.564 us -46.88% FAIL
F64 2^22 1 2 1 230.322 us 1.42% 121.412 us 3.23% -108.910 us -47.29% FAIL
F64 2^28 1 2 1 12.460 ms 0.09% 6.311 ms 0.11% -6149.789 us -49.35% FAIL
F64 2^14 10 2 1 39.365 us 8.03% 24.712 us 14.97% -14.653 us -37.22% FAIL
F64 2^22 10 2 1 283.241 us 1.25% 224.296 us 1.86% -58.945 us -20.81% FAIL
F64 2^28 10 2 1 15.983 ms 0.13% 13.039 ms 0.86% -2943.223 us -18.42% FAIL
F64 2^14 100 2 1 46.514 us 8.38% 28.323 us 10.77% -18.191 us -39.11% FAIL
F64 2^22 100 2 1 1.739 ms 0.16% 1.668 ms 0.53% -70.890 us -4.08% FAIL
F64 2^28 100 2 1 108.509 ms 0.06% 104.891 ms 0.04% -3618.662 us -3.33% FAIL
F64 2^14 1 2 20 31.566 us 6.77% 17.406 us 8.84% -14.160 us -44.86% FAIL
F64 2^22 1 2 20 230.307 us 1.47% 121.044 us 1.90% -109.263 us -47.44% FAIL
F64 2^28 1 2 20 12.462 ms 0.08% 6.315 ms 0.21% -6147.508 us -49.33% FAIL
F64 2^14 10 2 20 39.277 us 6.21% 23.033 us 16.63% -16.244 us -41.36% FAIL
F64 2^22 10 2 20 281.374 us 1.09% 219.307 us 1.12% -62.067 us -22.06% FAIL
F64 2^28 10 2 20 15.944 ms 0.47% 12.882 ms 0.72% -3062.193 us -19.21% FAIL
F64 2^14 100 2 20 46.079 us 8.03% 28.287 us 11.55% -17.792 us -38.61% FAIL
F64 2^22 100 2 20 1.740 ms 0.19% 1.664 ms 0.36% -75.602 us -4.35% FAIL
F64 2^28 100 2 20 108.475 ms 0.04% 104.920 ms 0.04% -3555.241 us -3.28% FAIL

row_variable_rolling_sum

[0] NVIDIA RTX A6000

T num_rows preceding_size following_size Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I32 2^14 10 2 20.407 us 17.95% 17.981 us 9.19% -2.427 us -11.89% FAIL
I32 2^22 10 2 145.353 us 1.96% 145.458 us 1.82% 0.105 us 0.07% PASS
I32 2^28 10 2 7.850 ms 0.17% 7.863 ms 0.21% 12.240 us 0.16% PASS
I32 2^14 100 2 26.130 us 6.01% 26.324 us 3.86% 0.194 us 0.74% PASS
I32 2^22 100 2 234.618 us 2.86% 245.878 us 3.30% 11.260 us 4.80% FAIL
I32 2^28 100 2 14.624 ms 0.79% 15.221 ms 0.68% 596.983 us 4.08% FAIL
F64 2^14 10 2 26.256 us 2.86% 26.508 us 6.46% 0.252 us 0.96% PASS
F64 2^22 10 2 223.001 us 1.96% 222.131 us 1.85% -0.870 us -0.39% PASS
F64 2^28 10 2 12.896 ms 0.58% 12.893 ms 0.78% -3.356 us -0.03% PASS
F64 2^14 100 2 33.438 us 6.71% 34.323 us 1.75% 0.885 us 2.65% FAIL
F64 2^22 100 2 1.667 ms 0.49% 1.672 ms 0.26% 5.138 us 0.31% FAIL
F64 2^28 100 2 104.673 ms 0.05% 104.982 ms 0.04% 309.519 us 0.30% FAIL

Summary

  • Total Matches: 80
    • Pass (diff <= min_noise): 6
    • Unknown (infinite noise): 0
    • Failure (diff > min_noise): 74
Before

Benchmark Results

row_grouped_rolling_sum

[0] NVIDIA RTX A6000

T num_rows preceding_size following_size min_periods cardinality Samples CPU Time Noise GPU Time Noise Mrows/s peak_memory_usage
I32 2^14 = 16384 1 2 1 10 840x 127.585 us 5.09% 123.556 us 3.90% 132 450.031 KiB
I32 2^28 = 268435456 1 2 1 10 601x 24.948 ms 0.13% 24.945 ms 0.13% 10761 7.031 GiB
I32 2^14 = 16384 10 2 1 10 1020x 126.638 us 5.18% 122.737 us 4.09% 133 450.031 KiB
I32 2^28 = 268435456 10 2 1 10 601x 24.947 ms 0.18% 24.944 ms 0.18% 10761 7.031 GiB
I32 2^14 = 16384 1 2 1 100 630x 123.913 us 5.30% 119.947 us 4.07% 136 450.031 KiB
I32 2^28 = 268435456 1 2 1 100 601x 24.958 ms 0.15% 24.954 ms 0.15% 10757 7.031 GiB
I32 2^14 = 16384 10 2 1 100 984x 127.839 us 4.98% 123.919 us 3.85% 132 450.031 KiB
I32 2^28 = 268435456 10 2 1 100 601x 24.958 ms 0.14% 24.955 ms 0.14% 10756 7.031 GiB
I32 2^14 = 16384 1 2 1 1000000 802x 125.851 us 5.65% 121.896 us 4.59% 134 450.031 KiB
I32 2^28 = 268435456 1 2 1 1000000 593x 25.294 ms 0.15% 25.291 ms 0.15% 10613 7.031 GiB
I32 2^14 = 16384 10 2 1 1000000 992x 130.416 us 4.81% 126.450 us 3.65% 129 450.031 KiB
I32 2^28 = 268435456 10 2 1 1000000 593x 25.298 ms 0.14% 25.295 ms 0.13% 10612 7.031 GiB
I32 2^14 = 16384 1 2 1 100000000 820x 126.300 us 5.56% 122.268 us 4.42% 134 450.031 KiB
I32 2^28 = 268435456 1 2 1 100000000 528x 28.413 ms 0.12% 28.409 ms 0.12% 9448 7.031 GiB
I32 2^14 = 16384 10 2 1 100000000 752x 131.420 us 4.30% 127.440 us 2.96% 128 450.031 KiB
I32 2^28 = 268435456 10 2 1 100000000 528x 28.398 ms 0.12% 28.395 ms 0.12% 9453 7.031 GiB
F64 2^14 = 16384 1 2 1 10 860x 123.857 us 5.40% 119.900 us 4.27% 136 450.031 KiB
F64 2^28 = 268435456 1 2 1 10 565x 26.551 ms 0.14% 26.547 ms 0.14% 10111 7.031 GiB
F64 2^14 = 16384 10 2 1 10 706x 131.193 us 4.81% 127.211 us 3.62% 128 450.031 KiB
F64 2^28 = 268435456 10 2 1 10 499x 30.091 ms 0.31% 30.087 ms 0.31% 8921 7.031 GiB
F64 2^14 = 16384 1 2 1 100 598x 124.407 us 4.10% 120.433 us 2.42% 136 450.031 KiB
F64 2^28 = 268435456 1 2 1 100 565x 26.572 ms 0.14% 26.568 ms 0.14% 10103 7.031 GiB
F64 2^14 = 16384 10 2 1 100 778x 130.659 us 4.55% 126.702 us 3.34% 129 450.031 KiB
F64 2^28 = 268435456 10 2 1 100 498x 30.109 ms 0.25% 30.106 ms 0.24% 8916 7.031 GiB
F64 2^14 = 16384 1 2 1 1000000 716x 123.521 us 5.27% 119.557 us 4.07% 137 450.031 KiB
F64 2^28 = 268435456 1 2 1 1000000 557x 26.909 ms 0.15% 26.906 ms 0.15% 9976 7.031 GiB
F64 2^14 = 16384 10 2 1 1000000 802x 132.551 us 5.00% 128.546 us 3.89% 127 450.031 KiB
F64 2^28 = 268435456 10 2 1 1000000 486x 30.902 ms 0.18% 30.898 ms 0.18% 8687 7.031 GiB
F64 2^14 = 16384 1 2 1 100000000 818x 124.957 us 5.54% 120.927 us 4.36% 135 450.031 KiB
F64 2^28 = 268435456 1 2 1 100000000 500x 30.000 ms 0.12% 29.997 ms 0.12% 8948 7.031 GiB
F64 2^14 = 16384 10 2 1 100000000 810x 131.756 us 4.90% 127.789 us 3.78% 128 450.031 KiB
F64 2^28 = 268435456 10 2 1 100000000 411x 36.558 ms 0.15% 36.555 ms 0.15% 7343 7.031 GiB

row_fixed_rolling_sum

[0] NVIDIA RTX A6000

T num_rows preceding_size following_size min_periods Samples CPU Time Noise GPU Time Noise Mrows/s peak_memory_usage
I32 2^14 = 16384 1 2 1 404x 35.776 us 13.79% 31.751 us 4.87% 516 258.016 KiB
I32 2^22 = 4194304 1 2 1 760x 210.368 us 2.49% 206.365 us 1.51% 20324 64.500 MiB
I32 2^28 = 268435456 1 2 1 1032x 10.886 ms 0.20% 10.883 ms 0.20% 24666 4.031 GiB
I32 2^14 = 16384 10 2 1 382x 35.926 us 13.76% 31.956 us 5.91% 512 258.016 KiB
I32 2^22 = 4194304 10 2 1 780x 210.618 us 2.52% 206.654 us 1.65% 20296 64.500 MiB
I32 2^28 = 268435456 10 2 1 938x 10.876 ms 0.12% 10.872 ms 0.12% 24690 4.031 GiB
I32 2^14 = 16384 100 2 1 370x 44.559 us 10.75% 40.611 us 4.58% 403 258.016 KiB
I32 2^22 = 4194304 100 2 1 782x 297.286 us 2.11% 293.156 us 1.54% 14307 64.500 MiB
I32 2^28 = 268435456 100 2 1 876x 17.113 ms 0.42% 17.109 ms 0.42% 15689 4.031 GiB
I32 2^14 = 16384 1 2 20 568x 36.961 us 14.10% 32.786 us 6.05% 499 258.016 KiB
I32 2^22 = 4194304 1 2 20 808x 210.503 us 2.51% 206.530 us 1.61% 20308 64.500 MiB
I32 2^28 = 268435456 1 2 20 1124x 10.886 ms 0.16% 10.883 ms 0.16% 24666 4.031 GiB
I32 2^14 = 16384 10 2 20 408x 35.851 us 12.88% 31.927 us 4.21% 513 258.016 KiB
I32 2^22 = 4194304 10 2 20 756x 210.040 us 2.50% 206.093 us 1.58% 20351 64.500 MiB
I32 2^28 = 268435456 10 2 20 1004x 10.876 ms 0.13% 10.872 ms 0.12% 24690 4.031 GiB
I32 2^14 = 16384 100 2 20 496x 44.365 us 10.65% 40.361 us 3.87% 405 258.016 KiB
I32 2^22 = 4194304 100 2 20 694x 296.938 us 2.29% 292.868 us 1.80% 14321 64.500 MiB
I32 2^28 = 268435456 100 2 20 878x 17.079 ms 0.47% 17.075 ms 0.47% 15720 4.031 GiB
F64 2^14 = 16384 1 2 1 622x 39.469 us 16.54% 35.335 us 11.59% 463 258.016 KiB
F64 2^22 = 4194304 1 2 1 784x 234.374 us 2.31% 230.322 us 1.42% 18210 64.500 MiB
F64 2^28 = 268435456 1 2 1 1060x 12.464 ms 0.09% 12.460 ms 0.09% 21543 4.031 GiB
F64 2^14 = 16384 10 2 1 434x 43.384 us 13.04% 39.365 us 8.03% 416 258.016 KiB
F64 2^22 = 4194304 10 2 1 648x 287.190 us 1.86% 283.241 us 1.25% 14808 64.500 MiB
F64 2^28 = 268435456 10 2 1 938x 15.986 ms 0.13% 15.983 ms 0.13% 16795 4.031 GiB
F64 2^14 = 16384 100 2 1 1006x 50.493 us 11.98% 46.514 us 8.38% 352 258.016 KiB
F64 2^22 = 4194304 100 2 1 678x 1.743 ms 0.28% 1.739 ms 0.16% 2411 64.500 MiB
F64 2^28 = 268435456 100 2 1 139x 108.511 ms 0.06% 108.509 ms 0.06% 2473 4.031 GiB
F64 2^14 = 16384 1 2 20 572x 35.501 us 14.15% 31.566 us 6.77% 519 258.016 KiB
F64 2^22 = 4194304 1 2 20 792x 234.279 us 2.27% 230.307 us 1.47% 18211 64.500 MiB
F64 2^28 = 268435456 1 2 20 1112x 12.466 ms 0.09% 12.462 ms 0.08% 21539 4.031 GiB
F64 2^14 = 16384 10 2 20 430x 43.210 us 11.79% 39.277 us 6.21% 417 258.016 KiB
F64 2^22 = 4194304 10 2 20 690x 285.352 us 1.80% 281.374 us 1.09% 14906 64.500 MiB
F64 2^28 = 268435456 10 2 20 886x 15.948 ms 0.47% 15.944 ms 0.47% 16835 4.031 GiB
F64 2^14 = 16384 100 2 20 618x 50.056 us 11.92% 46.079 us 8.03% 355 258.016 KiB
F64 2^22 = 4194304 100 2 20 710x 1.744 ms 0.30% 1.740 ms 0.19% 2411 64.500 MiB
F64 2^28 = 268435456 100 2 20 139x 108.477 ms 0.04% 108.475 ms 0.04% 2474 4.031 GiB

row_variable_rolling_sum

[0] NVIDIA RTX A6000

T num_rows preceding_size following_size Samples CPU Time Noise GPU Time Noise Mrows/s peak_memory_usage
I32 2^14 = 16384 10 2 436x 24.361 us 26.44% 20.407 us 17.95% 802 130.016 KiB
I32 2^22 = 4194304 10 2 714x 149.553 us 5.30% 145.353 us 1.96% 28856 32.500 MiB
I32 2^28 = 268435456 10 2 974x 7.854 ms 0.17% 7.850 ms 0.17% 34193 2.031 GiB
I32 2^14 = 16384 100 2 364x 30.150 us 16.63% 26.130 us 6.01% 627 130.016 KiB
I32 2^22 = 4194304 100 2 628x 238.815 us 3.44% 234.618 us 2.86% 17877 32.500 MiB
I32 2^28 = 268435456 100 2 624x 14.628 ms 0.79% 14.624 ms 0.79% 18355 2.031 GiB
F64 2^14 = 16384 10 2 440x 30.546 us 16.87% 26.256 us 2.86% 624 130.016 KiB
F64 2^22 = 4194304 10 2 574x 227.037 us 2.68% 223.001 us 1.96% 18808 32.500 MiB
F64 2^28 = 268435456 10 2 930x 12.900 ms 0.58% 12.896 ms 0.58% 20814 2.031 GiB
F64 2^14 = 16384 100 2 412x 37.479 us 13.85% 33.438 us 6.71% 489 130.016 KiB
F64 2^22 = 4194304 100 2 792x 1.671 ms 0.55% 1.667 ms 0.49% 2516 32.500 MiB
F64 2^28 = 268435456 100 2 144x 104.675 ms 0.05% 104.673 ms 0.05% 2564 2.031 GiB
After

Benchmark Results

row_grouped_rolling_sum

[0] NVIDIA RTX A6000

T num_rows preceding_size following_size min_periods cardinality Samples CPU Time Noise GPU Time Noise Mrows/s peak_memory_usage
I32 2^14 = 16384 1 2 1 10 610x 116.114 us 4.38% 112.111 us 2.55% 146 322.031 KiB
I32 2^28 = 268435456 1 2 1 10 875x 17.128 ms 0.22% 17.124 ms 0.22% 15676 5.031 GiB
I32 2^14 = 16384 10 2 1 10 1038x 113.661 us 12.55% 109.758 us 12.03% 149 322.031 KiB
I32 2^28 = 268435456 10 2 1 10 874x 17.147 ms 0.19% 17.143 ms 0.19% 15658 5.031 GiB
I32 2^14 = 16384 1 2 1 100 784x 112.206 us 5.86% 108.231 us 4.51% 151 322.031 KiB
I32 2^28 = 268435456 1 2 1 100 874x 17.144 ms 0.19% 17.140 ms 0.19% 15661 5.031 GiB
I32 2^14 = 16384 10 2 1 100 742x 114.504 us 5.52% 110.542 us 4.21% 148 322.031 KiB
I32 2^28 = 268435456 10 2 1 100 872x 17.189 ms 0.19% 17.185 ms 0.19% 15620 5.031 GiB
I32 2^14 = 16384 1 2 1 1000000 660x 113.501 us 5.42% 109.536 us 4.03% 149 322.031 KiB
I32 2^28 = 268435456 1 2 1 1000000 857x 17.490 ms 0.30% 17.486 ms 0.30% 15351 5.031 GiB
I32 2^14 = 16384 10 2 1 1000000 596x 115.135 us 5.20% 111.178 us 3.78% 147 322.031 KiB
I32 2^28 = 268435456 10 2 1 1000000 851x 17.608 ms 0.18% 17.604 ms 0.18% 15248 5.031 GiB
I32 2^14 = 16384 1 2 1 100000000 646x 113.959 us 4.91% 109.990 us 3.34% 148 322.031 KiB
I32 2^28 = 268435456 1 2 1 100000000 734x 20.436 ms 0.19% 20.432 ms 0.18% 13138 5.031 GiB
I32 2^14 = 16384 10 2 1 100000000 418x 116.388 us 4.26% 112.387 us 2.31% 145 322.031 KiB
I32 2^28 = 268435456 10 2 1 100000000 721x 20.796 ms 0.15% 20.792 ms 0.15% 12910 5.031 GiB
F64 2^14 = 16384 1 2 1 10 744x 113.892 us 5.18% 109.941 us 3.75% 149 322.031 KiB
F64 2^28 = 268435456 1 2 1 10 797x 18.798 ms 0.17% 18.794 ms 0.17% 14283 5.031 GiB
F64 2^14 = 16384 10 2 1 10 694x 114.918 us 5.14% 110.968 us 3.69% 147 322.031 KiB
F64 2^28 = 268435456 10 2 1 10 626x 23.972 ms 0.41% 23.968 ms 0.41% 11199 5.031 GiB
F64 2^14 = 16384 1 2 1 100 754x 114.013 us 5.51% 110.094 us 4.22% 148 322.031 KiB
F64 2^28 = 268435456 1 2 1 100 797x 18.809 ms 0.19% 18.805 ms 0.18% 14274 5.031 GiB
F64 2^14 = 16384 10 2 1 100 624x 114.799 us 5.12% 110.850 us 3.64% 147 322.031 KiB
F64 2^28 = 268435456 10 2 1 100 624x 24.035 ms 0.53% 24.031 ms 0.53% 11170 5.031 GiB
F64 2^14 = 16384 1 2 1 1000000 588x 114.699 us 5.00% 110.743 us 3.46% 147 322.031 KiB
F64 2^28 = 268435456 1 2 1 1000000 783x 19.138 ms 0.17% 19.134 ms 0.17% 14029 5.031 GiB
F64 2^14 = 16384 10 2 1 1000000 560x 114.636 us 5.26% 110.662 us 3.82% 148 322.031 KiB
F64 2^28 = 268435456 10 2 1 1000000 606x 24.757 ms 0.45% 24.753 ms 0.45% 10844 5.031 GiB
F64 2^14 = 16384 1 2 1 100000000 780x 113.344 us 5.66% 109.376 us 4.32% 149 322.031 KiB
F64 2^28 = 268435456 1 2 1 100000000 682x 21.974 ms 0.15% 21.970 ms 0.15% 12218 5.031 GiB
F64 2^14 = 16384 10 2 1 100000000 568x 115.521 us 4.91% 111.565 us 3.34% 146 322.031 KiB
F64 2^28 = 268435456 10 2 1 100000000 502x 29.905 ms 0.19% 29.901 ms 0.19% 8977 5.031 GiB

row_fixed_rolling_sum

[0] NVIDIA RTX A6000

T num_rows preceding_size following_size min_periods Samples CPU Time Noise GPU Time Noise Mrows/s peak_memory_usage
I32 2^14 = 16384 1 2 1 474x 21.860 us 26.56% 17.797 us 12.02% 920 130.016 KiB
I32 2^22 = 4194304 1 2 1 728x 100.117 us 5.62% 96.093 us 3.38% 43648 32.500 MiB
I32 2^28 = 268435456 1 2 1 824x 4.772 ms 0.21% 4.768 ms 0.19% 56303 2.031 GiB
I32 2^14 = 16384 10 2 1 424x 21.741 us 26.94% 17.721 us 14.31% 924 130.016 KiB
I32 2^22 = 4194304 10 2 1 660x 99.712 us 5.91% 95.725 us 4.13% 43816 32.500 MiB
I32 2^28 = 268435456 10 2 1 962x 4.763 ms 0.28% 4.759 ms 0.26% 56409 2.031 GiB
I32 2^14 = 16384 100 2 1 592x 30.530 us 19.38% 26.492 us 11.86% 618 130.016 KiB
I32 2^22 = 4194304 100 2 1 558x 238.409 us 2.91% 234.292 us 2.29% 17902 32.500 MiB
I32 2^28 = 268435456 100 2 1 814x 14.617 ms 0.47% 14.612 ms 0.47% 18370 2.031 GiB
I32 2^14 = 16384 1 2 20 474x 22.231 us 24.96% 17.994 us 7.92% 910 130.016 KiB
I32 2^22 = 4194304 1 2 20 776x 99.747 us 4.85% 95.737 us 2.39% 43810 32.500 MiB
I32 2^28 = 268435456 1 2 20 866x 4.771 ms 0.11% 4.767 ms 0.08% 56317 2.031 GiB
I32 2^14 = 16384 10 2 20 404x 21.898 us 25.01% 17.827 us 9.79% 919 130.016 KiB
I32 2^22 = 4194304 10 2 20 692x 99.264 us 4.72% 95.282 us 2.15% 44019 32.500 MiB
I32 2^28 = 268435456 10 2 20 706x 4.762 ms 0.29% 4.758 ms 0.28% 56419 2.031 GiB
I32 2^14 = 16384 100 2 20 584x 30.366 us 15.81% 26.306 us 3.31% 622 130.016 KiB
I32 2^22 = 4194304 100 2 20 540x 237.905 us 2.62% 233.802 us 1.91% 17939 32.500 MiB
I32 2^28 = 268435456 100 2 20 616x 14.618 ms 0.50% 14.614 ms 0.49% 18368 2.031 GiB
F64 2^14 = 16384 1 2 1 414x 23.969 us 28.36% 19.710 us 17.67% 831 130.016 KiB
F64 2^22 = 4194304 1 2 1 716x 125.282 us 4.01% 121.248 us 2.18% 34592 32.500 MiB
F64 2^28 = 268435456 1 2 1 1082x 6.315 ms 0.22% 6.311 ms 0.21% 42537 2.031 GiB
F64 2^14 = 16384 10 2 1 458x 29.235 us 19.23% 25.181 us 10.50% 650 130.016 KiB
F64 2^22 = 4194304 10 2 1 668x 229.256 us 2.39% 225.265 us 1.60% 18619 32.500 MiB
F64 2^28 = 268435456 10 2 1 1145x 13.088 ms 0.94% 13.084 ms 0.94% 20516 2.031 GiB
F64 2^14 = 16384 100 2 1 470x 36.907 us 14.33% 32.953 us 8.02% 497 130.016 KiB
F64 2^22 = 4194304 100 2 1 704x 1.699 ms 1.58% 1.695 ms 1.56% 2473 32.500 MiB
F64 2^28 = 268435456 100 2 1 143x 105.210 ms 0.37% 105.206 ms 0.37% 2551 2.031 GiB
F64 2^14 = 16384 1 2 20 416x 21.304 us 24.50% 17.376 us 9.38% 942 130.016 KiB
F64 2^22 = 4194304 1 2 20 584x 124.586 us 3.85% 120.627 us 1.91% 34770 32.500 MiB
F64 2^28 = 268435456 1 2 20 998x 6.318 ms 0.23% 6.314 ms 0.22% 42516 2.031 GiB
F64 2^14 = 16384 10 2 20 632x 28.582 us 20.04% 24.602 us 11.91% 665 130.016 KiB
F64 2^22 = 4194304 10 2 20 664x 222.036 us 2.17% 218.110 us 1.20% 19230 32.500 MiB
F64 2^28 = 268435456 10 2 20 1000x 12.880 ms 0.76% 12.876 ms 0.76% 20847 2.031 GiB
F64 2^14 = 16384 100 2 20 356x 33.704 us 17.99% 29.671 us 11.54% 552 130.016 KiB
F64 2^22 = 4194304 100 2 20 528x 1.668 ms 0.47% 1.664 ms 0.40% 2520 32.500 MiB
F64 2^28 = 268435456 100 2 20 143x 104.902 ms 0.04% 104.898 ms 0.04% 2559 2.031 GiB

row_variable_rolling_sum

[0] NVIDIA RTX A6000

T num_rows preceding_size following_size Samples CPU Time Noise GPU Time Noise Mrows/s peak_memory_usage
I32 2^14 = 16384 10 2 392x 21.847 us 24.65% 17.969 us 12.83% 911 130.016 KiB
I32 2^22 = 4194304 10 2 786x 149.751 us 3.19% 145.831 us 1.77% 28761 32.500 MiB
I32 2^28 = 268435456 10 2 1186x 7.867 ms 0.21% 7.863 ms 0.20% 34140 2.031 GiB
I32 2^14 = 16384 100 2 406x 29.974 us 16.38% 26.012 us 5.87% 629 130.016 KiB
I32 2^22 = 4194304 100 2 588x 246.351 us 3.57% 242.230 us 3.12% 17315 32.500 MiB
I32 2^28 = 268435456 100 2 612x 15.222 ms 0.75% 15.218 ms 0.75% 17639 2.031 GiB
F64 2^14 = 16384 10 2 410x 30.346 us 16.77% 26.087 us 3.69% 628 130.016 KiB
F64 2^22 = 4194304 10 2 632x 226.719 us 2.91% 222.736 us 2.26% 18830 32.500 MiB
F64 2^28 = 268435456 10 2 1058x 12.882 ms 0.82% 12.877 ms 0.82% 20845 2.031 GiB
F64 2^14 = 16384 100 2 414x 38.118 us 12.55% 34.123 us 4.52% 480 130.016 KiB
F64 2^22 = 4194304 100 2 658x 1.677 ms 0.33% 1.673 ms 0.24% 2507 32.500 MiB
F64 2^28 = 268435456 100 2 143x 104.991 ms 0.04% 104.987 ms 0.04% 2556 2.031 GiB

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@wence- wence- requested a review from a team as a code owner December 18, 2024 17:37
@wence- wence- requested review from shrshi and mhaseeb123 December 18, 2024 17:37
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Dec 18, 2024
@wence- wence- added improvement Improvement / enhancement to an existing function breaking Breaking change DO NOT MERGE Hold off on merging; see PR for details labels Dec 18, 2024
@wence-
Copy link
Contributor Author

wence- commented Dec 18, 2024

Sits on top of #17613, so don't merge until that is in and this is rebased/has trunk back-merged.

Copy link
Contributor Author

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signposts

cpp/benchmarks/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/benchmarks/rolling/grouped_rolling_sum.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/rolling/rolling_sum.cpp Outdated Show resolved Hide resolved
cpp/src/rolling/detail/rolling.cuh Show resolved Hide resolved
cpp/src/rolling/detail/rolling.cuh Show resolved Hide resolved
Comment on lines +126 to +133
namespace utils = cudf::detail::rolling;
auto groups = utils::grouped{group_labels.data(), group_offsets.data()};
auto preceding =
utils::make_clamped_window_iterator<utils::direction::PRECEDING>(preceding_window, groups);
auto following =
utils::make_clamped_window_iterator<utils::direction::FOLLOWING>(following_window, groups);
return cudf::detail::rolling_window(
input, default_outputs, preceding, following, min_periods, aggr, stream, mr);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the new utilities, so we can delete a bunch of code.

@@ -660,7 +661,7 @@ TEST_F(RollingErrorTest, WindowArraySizeMismatch)
cudf::test::fixed_width_column_wrapper<cudf::size_type> input(
col_data.begin(), col_data.end(), col_valid.begin());

std::vector<cudf::size_type> five({2, 1, 2, 1, 4});
std::vector<cudf::size_type> five({1, 1, 2, 1, 0});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs adapted to conform to new caller requirement.

Comment on lines -994 to -995
// this is a special test to check the volatile count variable issue (see rolling.cu for detail)
TYPED_TEST(RollingTest, VolatileCount)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All those volatiles are long-gone (cuda 10.1 bug)

Comment on lines +1121 to +1129
auto it = thrust::make_counting_iterator<cudf::size_type>(0);
std::transform(it, it + num_rows, preceding_window.begin(), [&window_rng, num_rows](auto i) {
auto p = window_rng.generate();
return std::min(i + 1, std::max(p, i + 1 - num_rows));
});
std::transform(it, it + num_rows, following_window.begin(), [&window_rng, num_rows](auto i) {
auto f = window_rng.generate();
return std::max(-i - 1, std::min(f, num_rows - i - 1));
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensuring the randomly generated values conform to interface requirements

Comment on lines +1155 to +1163
auto it = thrust::make_counting_iterator<cudf::size_type>(0);
std::transform(it, it + num_rows, preceding_window.begin(), [&window_rng, num_rows](auto i) {
auto p = window_rng.generate();
return std::min(i + 1, std::max(p, i + 1 - num_rows));
});
std::transform(it, it + num_rows, following_window.begin(), [&window_rng, num_rows](auto i) {
auto f = window_rng.generate();
return std::max(-i - 1, std::min(f, num_rows - i - 1));
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And again.

@wence-
Copy link
Contributor Author

wence- commented Dec 18, 2024

Note that I haven't touched the jit rolling window kernel yet (some of the same changes will apply there eventually).

@mythrocks This changes the API contract for to rolling_window, so spark may need to adapt.

@mythrocks mythrocks self-requested a review December 18, 2024 18:19
@mythrocks
Copy link
Contributor

@mythrocks This changes the API contract for to rolling_window, so spark may need to adapt.

I'm trying to work out what of this has impact on spark-rapids. Spark calls into cudf::grouped_rolling*() with fixed preceding/following offsets, which might exceed group bounds at the margins.

@wence- wence- requested a review from a team as a code owner December 19, 2024 10:15
@github-actions github-actions bot added the Java Affects Java cuDF API. label Dec 19, 2024
@wence-
Copy link
Contributor Author

wence- commented Dec 19, 2024

@mythrocks This changes the API contract for to rolling_window, so spark may need to adapt.

I'm trying to work out what of this has impact on spark-rapids. Spark calls into cudf::grouped_rolling*() with fixed preceding/following offsets, which might exceed group bounds at the margins.

I think this is a call into

std::unique_ptr<column> grouped_rolling_window(table_view const& group_keys,
column_view const& input,
column_view const& default_outputs,
window_bounds preceding_window_bounds,
window_bounds following_window_bounds,
size_type min_periods,
rolling_aggregation const& aggr,
rmm::cuda_stream_view stream,
rmm::device_async_resource_ref mr)

If so, those fixed offsets are still clamped to the group bounds, via a transform iterator:

auto groups = utils::grouped{group_labels.data(), group_offsets.data()};
auto preceding =
utils::make_clamped_window_iterator<utils::direction::PRECEDING>(preceding_window, groups);
auto following =
utils::make_clamped_window_iterator<utils::direction::FOLLOWING>(following_window, groups);
return cudf::detail::rolling_window(
input, default_outputs, preceding, following, min_periods, aggr, stream, mr);

@mythrocks
Copy link
Contributor

If so, those fixed offsets are still clamped to the group bounds, via a transform iterator:

That's what I thought. It seems to me like this really ought to work for spark-rapids without code change, if the transform iterator's doing the clamping.

I'll try have a go at integrating this with spark-rapids, and running some tests.

@mythrocks
Copy link
Contributor

I'd like to double-check my work, but it appears that the window-function tests in spark-rapids are running correctly with this refactored version.

I'll have another go at the tests to confirm.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good to me. I'm going to play with this some more, and try finalize the review tomorrow.

cpp/include/cudf/rolling.hpp Show resolved Hide resolved
cpp/src/rolling/detail/rolling.cuh Show resolved Hide resolved
cpp/src/rolling/detail/rolling_utils.cuh Show resolved Hide resolved
struct fixed_window_clamper {
Grouping groups;
cudf::size_type delta;
[[nodiscard]] __device__ cudf::size_type operator()(cudf::size_type i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you suppose we might get away with making this a const function?

Suggested change
[[nodiscard]] __device__ cudf::size_type operator()(cudf::size_type i)
[[nodiscard]] __device__ cudf::size_type operator()(cudf::size_type i) const

Copy link
Contributor Author

@wence- wence- Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, thanks. I guess it can be noexcept as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it can't be noexcept because min/max are not noexcept

cpp/src/rolling/detail/rolling_utils.cuh Show resolved Hide resolved
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I've verified that this change does not intrude on the spark-rapids functionality. The PR keeps the API and behaviour intact, at least from spark-rapids's perspective.

@wence- wence- force-pushed the wence/fea/rolling-perf branch from 030ae7a to 7e9922b Compare January 8, 2025 11:32
@wence- wence- requested a review from a team as a code owner January 8, 2025 11:32
wence- added 6 commits January 8, 2025 14:54
This way, the responsibility of a producing an iterator that is in
bounds falls on the caller, and we don't have to do extra work in
cases where the input is guaranteed to be in bounds already.
When passing column_view objects for preceding and following windows,
the caller is responsible for ensuring that any resulting indexing is
in-bounds.
The caller must provide in-bounds columns for the preceding
and following windows.
Now that the requirement for producing in-bounds data has moved to the
caller, we must adapt the tests to produce the correct inputs.
@wence- wence- force-pushed the wence/fea/rolling-perf branch from 2df0426 to 3b3b15f Compare January 8, 2025 14:56
@github-actions github-actions bot removed the CMake CMake build issue label Jan 8, 2025
@wence- wence- removed the DO NOT MERGE Hold off on merging; see PR for details label Jan 8, 2025
@wence- wence- removed the request for review from a team January 8, 2025 16:23
Comment on lines +1040 to +1049
// The caller is required to provide window bounds that will
// result in indexing that is in-bounds for the column. Therefore all
// of these calculations cannot overflow and we can do everything
// in size_type arithmetic. Moreover, we require that start <=
// end, i.e., the window is never "reversed" though it may be empty.
auto const preceding_window = preceding_window_begin[i];
auto const following_window = following_window_begin[i];

// compute bounds
auto const start = static_cast<size_type>(
min(static_cast<int64_t>(input.size()), max(int64_t{0}, i - preceding_window + 1)));
auto const end = static_cast<size_type>(
min(static_cast<int64_t>(input.size()), max(int64_t{0}, i + following_window + 1)));
auto const start_index = min(start, end);
auto const end_index = max(start, end);
size_type const start = i - preceding_window + 1;
size_type const end = i + following_window + 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my working for integer overflow here. preceding_window and following_window both have type size_type (as does i). So this is all happening in size_type arithmetic.

Since i is the row index, it is contained in [0, ..., num_rows). So i + 1 is always safe. So we just have to make sure that subtracting a negative preceding window never overflows, and adding a positive following window never overflows. This is handled by the clamping kernel.

So if we inline that logic, this looks like:

size_type start = i - cuda::std::min(i + 1, cuda::std::max(delta, i + 1 - num_rows))) + 1;
size_type end = i + cuda::std::max(- i - 1, cuda::std::min(delta, num_rows - i - 1)) + 1;

I think none of these operations overflow if num_rows == INT_MAX but please check.

Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too familiar with the design of rolling windows in libcudf per se, but the changes mostly make sense to me. Would recommend feedback from another more familiar reviewer if possible

Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall. A couple of questions -

cpp/src/rolling/detail/lead_lag_nested.cuh Show resolved Hide resolved
cpp/src/rolling/detail/rolling_utils.cuh Show resolved Hide resolved
@wence-
Copy link
Contributor Author

wence- commented Jan 20, 2025

/merge

cpp/src/rolling/detail/rolling_utils.cuh Outdated Show resolved Hide resolved
cpp/src/rolling/detail/rolling_utils.cuh Outdated Show resolved Hide resolved
@rapids-bot rapids-bot bot merged commit bcda85f into rapidsai:branch-25.02 Jan 23, 2025
106 checks passed
@wence- wence- deleted the wence/fea/rolling-perf branch January 23, 2025 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: Landed
Development

Successfully merging this pull request may close these issues.

5 participants