Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added polynomials benchmark #17695

Open
wants to merge 10 commits into
base: branch-25.02
Choose a base branch
from

Conversation

lamarrr
Copy link
Contributor

@lamarrr lamarrr commented Jan 8, 2025

Description

This merge request implements benchmarks for comparing the AST, UDF Transform, and BINARY_OP methods by computing a polynomial.

Closes #17561

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Jan 8, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jan 8, 2025
@lamarrr lamarrr added feature request New feature or request non-breaking Non-breaking change and removed libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jan 8, 2025
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jan 8, 2025
@lamarrr lamarrr marked this pull request as ready for review January 8, 2025 15:43
@lamarrr lamarrr requested review from a team as code owners January 8, 2025 15:43
@lamarrr lamarrr requested review from vyasr and mhaseeb123 January 8, 2025 15:43
@lamarrr
Copy link
Contributor Author

lamarrr commented Jan 8, 2025

Benchmark Results

ast_polynomials_float32

[0] NVIDIA RTX A6000

num_rows order Samples CPU Time Noise GPU Time Noise GlobalMem BW BWUtil
100000 1 8624x 62.750 us 97.80% 57.989 us 69.68% 13.796 GB/s 1.80%
1000000 1 3600x 145.025 us 48.47% 139.172 us 35.04% 57.483 GB/s 7.48%
10000000 1 2960x 925.562 us 12.18% 916.996 us 10.13% 87.241 GB/s 11.36%
100000000 1 1731x 8.654 ms 3.03% 8.648 ms 2.90% 92.505 GB/s 12.04%
100000 2 8304x 64.928 us 74.94% 60.233 us 55.71% 13.282 GB/s 1.73%
1000000 2 3456x 161.172 us 50.53% 156.207 us 45.04% 51.214 GB/s 6.67%
10000000 2 1104x 1.118 ms 13.91% 1.112 ms 11.82% 71.973 GB/s 9.37%
100000000 2 1427x 10.504 ms 2.89% 10.498 ms 2.81% 76.203 GB/s 9.92%
100000 4 7344x 72.841 us 94.53% 68.158 us 74.67% 11.738 GB/s 1.53%
1000000 4 2512x 204.972 us 44.27% 199.956 us 38.54% 40.009 GB/s 5.21%
10000000 4 3024x 1.480 ms 7.95% 1.475 ms 7.38% 54.238 GB/s 7.06%
100000000 4 1048x 14.307 ms 2.81% 14.302 ms 2.78% 55.934 GB/s 7.28%
100000 8 6112x 86.564 us 81.06% 81.857 us 72.73% 9.773 GB/s 1.27%
1000000 8 1952x 290.651 us 27.06% 284.928 us 21.48% 28.077 GB/s 3.66%
10000000 8 2640x 2.318 ms 6.09% 2.313 ms 5.98% 34.581 GB/s 4.50%
100000000 8 659x 22.770 ms 2.38% 22.763 ms 2.34% 35.145 GB/s 4.58%
100000 16 4752x 110.209 us 53.98% 105.354 us 36.32% 7.593 GB/s 0.99%
1000000 16 1536x 485.190 us 19.22% 478.043 us 13.43% 16.735 GB/s 2.18%
10000000 16 2432x 4.111 ms 3.75% 4.106 ms 3.45% 19.486 GB/s 2.54%
100000000 16 370x 40.624 ms 2.18% 40.619 ms 2.17% 19.695 GB/s 2.56%
100000 32 3344x 170.579 us 53.09% 165.085 us 36.15% 4.846 GB/s 0.63%
1000000 32 880x 878.193 us 14.39% 870.225 us 10.87% 9.193 GB/s 1.20%
10000000 32 1923x 7.785 ms 2.94% 7.781 ms 2.94% 10.281 GB/s 1.34%
100000000 32 195x 77.241 ms 2.21% 77.237 ms 2.21% 10.358 GB/s 1.35%

ast_polynomials_float64

[0] NVIDIA RTX A6000

num_rows order Samples CPU Time Noise GPU Time Noise GlobalMem BW BWUtil
100000 1 8944x 60.571 us 94.96% 55.977 us 80.33% 28.583 GB/s 3.72%
1000000 1 3760x 141.851 us 39.63% 136.533 us 30.25% 117.188 GB/s 15.26%
10000000 1 1072x 979.935 us 15.78% 972.653 us 12.24% 164.498 GB/s 21.42%
100000000 1 1627x 9.210 ms 2.63% 9.205 ms 2.61% 173.823 GB/s 22.63%
100000 2 8304x 64.632 us 96.61% 60.311 us 91.04% 26.529 GB/s 3.45%
1000000 2 3008x 171.172 us 45.64% 166.443 us 40.72% 96.129 GB/s 12.52%
10000000 2 976x 1.180 ms 14.23% 1.172 ms 11.63% 136.504 GB/s 17.77%
100000000 2 1349x 11.109 ms 2.69% 11.104 ms 2.65% 144.093 GB/s 18.76%
100000 4 7264x 73.396 us 81.20% 68.850 us 73.87% 23.239 GB/s 3.03%
1000000 4 2480x 208.299 us 39.39% 202.918 us 29.89% 78.850 GB/s 10.27%
10000000 4 2976x 1.568 ms 7.81% 1.563 ms 6.62% 102.397 GB/s 13.33%
100000000 4 989x 15.163 ms 2.63% 15.157 ms 2.51% 105.559 GB/s 13.74%
100000 8 5952x 88.764 us 78.95% 84.110 us 68.15% 19.023 GB/s 2.48%
1000000 8 1808x 303.677 us 28.42% 297.680 us 20.97% 53.749 GB/s 7.00%
10000000 8 2720x 2.470 ms 5.70% 2.466 ms 5.43% 64.890 GB/s 8.45%
100000000 8 621x 24.179 ms 2.41% 24.174 ms 2.41% 66.186 GB/s 8.62%
100000 16 4496x 116.183 us 65.36% 111.280 us 49.33% 14.378 GB/s 1.87%
1000000 16 1600x 516.242 us 23.96% 509.972 us 18.36% 31.374 GB/s 4.08%
10000000 16 2496x 4.384 ms 4.08% 4.379 ms 4.06% 36.538 GB/s 4.76%
100000000 16 346x 43.341 ms 2.22% 43.337 ms 2.22% 36.920 GB/s 4.81%
100000 32 3152x 177.608 us 37.78% 172.243 us 28.03% 9.289 GB/s 1.21%
1000000 32 1376x 939.146 us 15.96% 930.330 us 12.21% 17.198 GB/s 2.24%
10000000 32 1798x 8.328 ms 3.29% 8.323 ms 3.20% 19.224 GB/s 2.50%
100000000 32 183x 82.325 ms 1.91% 82.314 ms 1.87% 19.438 GB/s 2.53%

binaryop_polynomials_float32

[0] NVIDIA RTX A6000

num_rows order Samples CPU Time Noise GPU Time Noise GlobalMem BW BWUtil
100000 1 4704x 117.778 us 63.79% 112.836 us 54.66% 7.090 GB/s 0.92%
1000000 1 3376x 154.063 us 67.58% 148.416 us 54.34% 53.903 GB/s 7.02%
10000000 1 1056x 485.723 us 23.02% 479.950 us 21.75% 166.684 GB/s 21.70%
100000000 1 2704x 3.609 ms 4.67% 3.604 ms 4.56% 221.954 GB/s 28.90%
100000 2 2848x 208.347 us 57.17% 203.849 us 54.97% 3.924 GB/s 0.51%
1000000 2 2368x 270.024 us 41.59% 264.843 us 36.10% 30.207 GB/s 3.93%
10000000 2 2944x 853.241 us 18.28% 848.724 us 17.97% 94.259 GB/s 12.27%
100000000 2 2254x 6.642 ms 2.99% 6.638 ms 2.92% 120.525 GB/s 15.69%
100000 4 1424x 381.545 us 42.62% 377.103 us 42.29% 2.121 GB/s 0.28%
1000000 4 1248x 504.287 us 38.98% 500.249 us 38.98% 15.992 GB/s 2.08%
10000000 4 2544x 1.600 ms 10.54% 1.595 ms 10.21% 50.143 GB/s 6.53%
100000000 4 1179x 12.718 ms 2.06% 12.713 ms 2.01% 62.930 GB/s 8.19%
100000 8 2592x 730.357 us 25.04% 726.121 us 25.01% 1.102 GB/s 0.14%
1000000 8 2848x 975.008 us 17.37% 970.472 us 17.29% 8.243 GB/s 1.07%
10000000 8 2592x 3.089 ms 6.60% 3.085 ms 6.59% 25.936 GB/s 3.38%
100000000 8 603x 24.882 ms 1.78% 24.877 ms 1.78% 32.159 GB/s 4.19%
100000 16 2896x 1.396 ms 13.28% 1.392 ms 13.26% 574.836 MB/s 0.07%
1000000 16 2656x 1.891 ms 10.16% 1.887 ms 10.12% 4.240 GB/s 0.55%
10000000 16 2288x 6.072 ms 4.02% 6.068 ms 4.02% 13.184 GB/s 1.72%
100000000 16 305x 49.191 ms 1.49% 49.187 ms 1.49% 16.264 GB/s 2.12%
100000 32 2560x 2.739 ms 7.19% 2.735 ms 7.19% 292.509 MB/s 0.04%
1000000 32 2512x 3.728 ms 5.67% 3.724 ms 5.65% 2.148 GB/s 0.28%
10000000 32 1246x 12.032 ms 2.86% 12.028 ms 2.86% 6.651 GB/s 0.87%
100000000 32 154x 97.796 ms 1.32% 97.790 ms 1.31% 8.181 GB/s 1.07%

binaryop_polynomials_float64

[0] NVIDIA RTX A6000

num_rows order Samples CPU Time Noise GPU Time Noise GlobalMem BW BWUtil
100000 1 4272x 123.582 us 71.33% 118.359 us 65.54% 13.518 GB/s 1.76%
1000000 1 2880x 186.099 us 60.54% 181.248 us 51.94% 88.277 GB/s 11.49%
10000000 1 1312x 817.490 us 21.60% 810.321 us 17.82% 197.453 GB/s 25.71%
100000000 1 2166x 6.915 ms 3.32% 6.911 ms 3.26% 231.530 GB/s 30.14%
100000 2 2784x 210.072 us 57.55% 205.073 us 54.61% 7.802 GB/s 1.02%
1000000 2 2048x 336.892 us 40.99% 332.070 us 38.37% 48.183 GB/s 6.27%
10000000 2 2720x 1.458 ms 10.21% 1.453 ms 10.06% 110.096 GB/s 14.33%
100000000 2 1176x 12.749 ms 2.92% 12.745 ms 2.90% 125.539 GB/s 16.34%
100000 4 1856x 386.265 us 43.40% 381.759 us 42.57% 4.191 GB/s 0.55%
1000000 4 1312x 637.687 us 30.31% 633.415 us 30.29% 25.260 GB/s 3.29%
10000000 4 2832x 2.759 ms 7.08% 2.755 ms 7.07% 58.081 GB/s 7.56%
100000000 4 616x 24.359 ms 2.00% 24.353 ms 1.97% 65.700 GB/s 8.55%
100000 8 1472x 744.581 us 28.61% 740.213 us 28.52% 2.162 GB/s 0.28%
1000000 8 2784x 1.204 ms 16.47% 1.200 ms 16.44% 13.338 GB/s 1.74%
10000000 8 2528x 5.351 ms 3.83% 5.347 ms 3.83% 29.926 GB/s 3.90%
100000000 8 315x 47.648 ms 1.92% 47.644 ms 1.92% 33.583 GB/s 4.37%
100000 16 2992x 1.408 ms 12.50% 1.403 ms 12.43% 1.140 GB/s 0.15%
1000000 16 2800x 2.344 ms 8.19% 2.340 ms 8.17% 6.838 GB/s 0.89%
10000000 16 1417x 10.578 ms 2.84% 10.574 ms 2.84% 15.131 GB/s 1.97%
100000000 16 160x 94.177 ms 1.66% 94.173 ms 1.66% 16.990 GB/s 2.21%
100000 32 2608x 2.767 ms 7.69% 2.763 ms 7.68% 579.118 MB/s 0.08%
1000000 32 2368x 4.618 ms 4.99% 4.614 ms 4.98% 3.468 GB/s 0.45%
10000000 32 715x 20.984 ms 2.39% 20.980 ms 2.39% 7.626 GB/s 0.99%
100000000 32 81x 187.240 ms 1.59% 187.236 ms 1.59% 8.545 GB/s 1.11%

transform_polynomials_float32

[0] NVIDIA RTX A6000

num_rows order Samples CPU Time Noise GPU Time Noise GlobalMem BW BWUtil
100000 1 32416x 19.348 us 77.40% 15.430 us 48.66% 51.846 GB/s 6.75%
1000000 1 19584x 30.328 us 74.07% 25.533 us 40.52% 313.325 GB/s 40.79%
10000000 1 3888x 133.358 us 13.62% 128.948 us 4.07% 620.406 GB/s 80.77%
100000000 1 528x 1.188 ms 2.73% 1.182 ms 1.85% 676.966 GB/s 88.14%
100000 2 32608x 19.132 us 45.59% 15.340 us 21.96% 52.152 GB/s 6.79%
1000000 2 17520x 32.353 us 22.69% 28.549 us 10.46% 280.224 GB/s 36.48%
10000000 2 3136x 163.963 us 5.83% 159.835 us 1.16% 500.515 GB/s 65.16%
100000000 2 336x 1.498 ms 1.28% 1.493 ms 0.34% 536.010 GB/s 69.78%
100000 4 30048x 20.378 us 35.91% 16.646 us 17.64% 48.061 GB/s 6.26%
1000000 4 13584x 40.727 us 19.51% 36.850 us 10.48% 217.098 GB/s 28.26%
10000000 4 2496x 239.779 us 3.33% 235.780 us 0.85% 339.300 GB/s 44.17%
100000000 4 1680x 2.316 ms 8.20% 2.302 ms 6.44% 347.575 GB/s 45.25%
100000 8 25024x 24.476 us 108.07% 19.990 us 70.52% 40.020 GB/s 5.21%
1000000 8 9632x 60.319 us 85.39% 53.864 us 36.58% 148.523 GB/s 19.34%
10000000 8 1632x 402.611 us 17.12% 390.418 us 3.81% 204.909 GB/s 26.68%
100000000 8 2768x 3.786 ms 2.58% 3.778 ms 2.07% 211.727 GB/s 27.57%
100000 16 19968x 29.711 us 84.60% 25.059 us 34.05% 31.924 GB/s 4.16%
1000000 16 6064x 91.561 us 37.72% 86.004 us 6.89% 93.019 GB/s 12.11%
10000000 16 1440x 710.546 us 10.09% 696.614 us 1.55% 114.841 GB/s 14.95%
100000000 16 2184x 6.849 ms 2.65% 6.843 ms 2.33% 116.909 GB/s 15.22%
100000 32 14016x 40.378 us 70.76% 35.677 us 37.83% 22.423 GB/s 2.92%
1000000 32 3312x 157.167 us 27.50% 151.491 us 12.17% 52.808 GB/s 6.88%
10000000 32 384x 1.317 ms 5.87% 1.305 ms 0.15% 61.308 GB/s 7.98%
100000000 32 1150x 13.029 ms 2.39% 13.024 ms 2.34% 61.425 GB/s 8.00%

transform_polynomials_float64

[0] NVIDIA RTX A6000

num_rows order Samples CPU Time Noise GPU Time Noise GlobalMem BW BWUtil
100000 1 30976x 20.645 us 124.18% 16.150 us 58.23% 99.071 GB/s 12.90%
1000000 1 13648x 41.642 us 71.30% 36.666 us 31.21% 436.374 GB/s 56.81%
10000000 1 2080x 248.169 us 19.42% 241.596 us 3.43% 662.264 GB/s 86.22%
100000000 1 576x 2.431 ms 9.28% 2.419 ms 8.10% 661.512 GB/s 86.12%
100000 2 30704x 20.626 us 129.04% 16.291 us 87.96% 98.215 GB/s 12.79%
1000000 2 13584x 41.589 us 62.09% 36.833 us 21.61% 434.396 GB/s 56.55%
10000000 2 2048x 251.900 us 19.57% 244.407 us 3.94% 654.646 GB/s 85.23%
100000000 2 576x 2.442 ms 9.85% 2.426 ms 8.08% 659.452 GB/s 85.86%
100000 4 30064x 21.146 us 113.60% 16.633 us 47.77% 96.196 GB/s 12.52%
1000000 4 13264x 42.455 us 71.25% 37.719 us 44.02% 424.193 GB/s 55.23%
10000000 4 2432x 254.104 us 17.31% 247.229 us 4.66% 647.174 GB/s 84.26%
100000000 4 2352x 2.379 ms 5.69% 2.372 ms 4.79% 674.601 GB/s 87.83%
100000 8 25952x 23.795 us 114.08% 19.273 us 58.79% 83.016 GB/s 10.81%
1000000 8 10720x 51.508 us 51.45% 46.669 us 18.68% 342.837 GB/s 44.63%
10000000 8 1568x 331.378 us 15.90% 324.302 us 4.15% 493.367 GB/s 64.23%
100000000 8 2384x 3.136 ms 3.51% 3.129 ms 2.62% 511.328 GB/s 66.57%
100000 16 20304x 29.124 us 86.04% 24.634 us 53.54% 64.950 GB/s 8.46%
1000000 16 6544x 85.356 us 42.17% 79.695 us 8.80% 200.766 GB/s 26.14%
10000000 16 2928x 639.720 us 7.40% 630.674 us 1.14% 253.697 GB/s 33.03%
100000000 16 2415x 6.189 ms 2.67% 6.183 ms 2.47% 258.773 GB/s 33.69%
100000 32 14128x 40.027 us 62.44% 35.421 us 23.68% 45.171 GB/s 5.88%
1000000 32 3584x 152.144 us 30.17% 145.790 us 14.37% 109.747 GB/s 14.29%
10000000 32 402x 1.263 ms 7.83% 1.245 ms 0.31% 128.471 GB/s 16.73%
100000000 32 1217x 12.311 ms 2.23% 12.304 ms 2.03% 130.038 GB/s 16.93%

order 1

newplot (4)

order 2

newplot (5)

order 4

newplot (6)

order 8

newplot (7)

order 16

newplot (8)

order 32

newplot (9)

Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me with only a couple suggestions.

Comment on lines 46 to 51
std::vector<cudf::numeric_scalar<key_type>> constants;

std::transform(thrust::make_counting_iterator(0),
thrust::make_counting_iterator(order + 1),
std::back_inserter(constants),
[](int) { return cudf::numeric_scalar<key_type>(1); });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<cudf::numeric_scalar<key_type>> constants;
std::transform(thrust::make_counting_iterator(0),
thrust::make_counting_iterator(order + 1),
std::back_inserter(constants),
[](int) { return cudf::numeric_scalar<key_type>(1); });
std::vector<cudf::numeric_scalar<key_type>> constants(order + 1);
std::fill(constants.begin(),
constants.end(),
cudf::numeric_scalar<key_type>(1));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's temporary in the mean time as we intend to use a random constant generator to fill the range

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, makes sense. Thanks for confirming!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for us to implement the full solution in this PR? Otherwise, we need a TODO to track this "temporary" solution. I think this will be forgotten otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

cpp/benchmarks/ast/polynomials.cpp Show resolved Hide resolved
cpp/benchmarks/binaryop/polynomials.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/binaryop/polynomials.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/transform/polynomials.cpp Show resolved Hide resolved
cpp/benchmarks/transform/polynomials.cpp Show resolved Hide resolved
Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, except for the UDF part as I am not too familiar with its semantics.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work so far. I have a few considerations. See comments.

cpp/benchmarks/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/benchmarks/CMakeLists.txt Outdated Show resolved Hide resolved
Comment on lines 46 to 51
std::vector<cudf::numeric_scalar<key_type>> constants;

std::transform(thrust::make_counting_iterator(0),
thrust::make_counting_iterator(order + 1),
std::back_inserter(constants),
[](int) { return cudf::numeric_scalar<key_type>(1); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for us to implement the full solution in this PR? Otherwise, we need a TODO to track this "temporary" solution. I think this will be forgotten otherwise.

cpp/benchmarks/binaryop/polynomials.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/transform/polynomials.cpp Outdated Show resolved Hide resolved
std::string expr = std::to_string(constants[0]);

for (cudf::size_type i = 0; i < order; i++) {
expr = "( " + expr + " ) * x + " + std::to_string(constants[i + 1]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think it's a legitimate benchmark if we hardcode these constants? I think it would be more fair to the other AST/binaryops if we provided an array of device scalar pointers that must be dereferenced for each multiplication.

A different way to think about this: do we want to JIT a new kernel for every possible set of constants and/or every possible order of polynomial? If we compute multiple polynomials, does that JIT overhead pay for itself, or do we need to assume that we will amortize the JIT overhead across multiple polynomials?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transform API is pretty limited presently, We don't support multiple input columns/scalars in the transform API. As we discussed with Spark-Rapids, we would need to support that to meet their needs, but I don't think it would make much of a performance/throughput difference for this benchmark.
No, it wouldn't be okay to JIT a new kernel for each constant, but it would be reasonable to JIT a new kernel for each polynomial order.

If we compute multiple polynomials, does that JIT overhead pay for itself, or do we need to assume that we will amortize the JIT overhead across multiple polynomials?

I don't think it would pay for itself across multiple polynomials, Although there's a program cache, I don't have enough insight into its internals, I'll investigate and get back to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: Burndown
Development

Successfully merging this pull request may close these issues.

[FEA] Add benchmarks for computing polynomials
3 participants