-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements and simplifications for fixed size row-based rolling windows #17623
Performance improvements and simplifications for fixed size row-based rolling windows #17623
Conversation
Sits on top of #17613, so don't merge until that is in and this is rebased/has trunk back-merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signposts
namespace utils = cudf::detail::rolling; | ||
auto groups = utils::grouped{group_labels.data(), group_offsets.data()}; | ||
auto preceding = | ||
utils::make_clamped_window_iterator<utils::direction::PRECEDING>(preceding_window, groups); | ||
auto following = | ||
utils::make_clamped_window_iterator<utils::direction::FOLLOWING>(following_window, groups); | ||
return cudf::detail::rolling_window( | ||
input, default_outputs, preceding, following, min_periods, aggr, stream, mr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the new utilities, so we can delete a bunch of code.
@@ -660,7 +661,7 @@ TEST_F(RollingErrorTest, WindowArraySizeMismatch) | |||
cudf::test::fixed_width_column_wrapper<cudf::size_type> input( | |||
col_data.begin(), col_data.end(), col_valid.begin()); | |||
|
|||
std::vector<cudf::size_type> five({2, 1, 2, 1, 4}); | |||
std::vector<cudf::size_type> five({1, 1, 2, 1, 0}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs adapted to conform to new caller requirement.
// this is a special test to check the volatile count variable issue (see rolling.cu for detail) | ||
TYPED_TEST(RollingTest, VolatileCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All those volatiles are long-gone (cuda 10.1 bug)
auto it = thrust::make_counting_iterator<cudf::size_type>(0); | ||
std::transform(it, it + num_rows, preceding_window.begin(), [&window_rng, num_rows](auto i) { | ||
auto p = window_rng.generate(); | ||
return std::min(i + 1, std::max(p, i + 1 - num_rows)); | ||
}); | ||
std::transform(it, it + num_rows, following_window.begin(), [&window_rng, num_rows](auto i) { | ||
auto f = window_rng.generate(); | ||
return std::max(-i - 1, std::min(f, num_rows - i - 1)); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensuring the randomly generated values conform to interface requirements
auto it = thrust::make_counting_iterator<cudf::size_type>(0); | ||
std::transform(it, it + num_rows, preceding_window.begin(), [&window_rng, num_rows](auto i) { | ||
auto p = window_rng.generate(); | ||
return std::min(i + 1, std::max(p, i + 1 - num_rows)); | ||
}); | ||
std::transform(it, it + num_rows, following_window.begin(), [&window_rng, num_rows](auto i) { | ||
auto f = window_rng.generate(); | ||
return std::max(-i - 1, std::min(f, num_rows - i - 1)); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And again.
Note that I haven't touched the jit rolling window kernel yet (some of the same changes will apply there eventually). @mythrocks This changes the API contract for to rolling_window, so spark may need to adapt. |
I'm trying to work out what of this has impact on |
I think this is a call into cudf/cpp/src/rolling/grouped_rolling.cu Lines 47 to 55 in 030ae7a
If so, those fixed offsets are still clamped to the group bounds, via a transform iterator: cudf/cpp/src/rolling/grouped_rolling.cu Lines 127 to 133 in 030ae7a
|
That's what I thought. It seems to me like this really ought to work for I'll try have a go at integrating this with |
I'd like to double-check my work, but it appears that the window-function tests in I'll have another go at the tests to confirm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking pretty good to me. I'm going to play with this some more, and try finalize the review tomorrow.
struct fixed_window_clamper { | ||
Grouping groups; | ||
cudf::size_type delta; | ||
[[nodiscard]] __device__ cudf::size_type operator()(cudf::size_type i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you suppose we might get away with making this a const
function?
[[nodiscard]] __device__ cudf::size_type operator()(cudf::size_type i) | |
[[nodiscard]] __device__ cudf::size_type operator()(cudf::size_type i) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, thanks. I guess it can be noexcept
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it can't be noexcept because min/max are not noexcept
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I've verified that this change does not intrude on the spark-rapids
functionality. The PR keeps the API and behaviour intact, at least from spark-rapids
's perspective.
030ae7a
to
7e9922b
Compare
This way, the responsibility of a producing an iterator that is in bounds falls on the caller, and we don't have to do extra work in cases where the input is guaranteed to be in bounds already.
When passing column_view objects for preceding and following windows, the caller is responsible for ensuring that any resulting indexing is in-bounds.
The caller must provide in-bounds columns for the preceding and following windows.
Now that the requirement for producing in-bounds data has moved to the caller, we must adapt the tests to produce the correct inputs.
2df0426
to
3b3b15f
Compare
// The caller is required to provide window bounds that will | ||
// result in indexing that is in-bounds for the column. Therefore all | ||
// of these calculations cannot overflow and we can do everything | ||
// in size_type arithmetic. Moreover, we require that start <= | ||
// end, i.e., the window is never "reversed" though it may be empty. | ||
auto const preceding_window = preceding_window_begin[i]; | ||
auto const following_window = following_window_begin[i]; | ||
|
||
// compute bounds | ||
auto const start = static_cast<size_type>( | ||
min(static_cast<int64_t>(input.size()), max(int64_t{0}, i - preceding_window + 1))); | ||
auto const end = static_cast<size_type>( | ||
min(static_cast<int64_t>(input.size()), max(int64_t{0}, i + following_window + 1))); | ||
auto const start_index = min(start, end); | ||
auto const end_index = max(start, end); | ||
size_type const start = i - preceding_window + 1; | ||
size_type const end = i + following_window + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check my working for integer overflow here. preceding_window
and following_window
both have type size_type
(as does i
). So this is all happening in size_type
arithmetic.
Since i
is the row index, it is contained in [0, ..., num_rows)
. So i + 1
is always safe. So we just have to make sure that subtracting a negative preceding window never overflows, and adding a positive following window never overflows. This is handled by the clamping kernel.
So if we inline that logic, this looks like:
size_type start = i - cuda::std::min(i + 1, cuda::std::max(delta, i + 1 - num_rows))) + 1;
size_type end = i + cuda::std::max(- i - 1, cuda::std::min(delta, num_rows - i - 1)) + 1;
I think none of these operations overflow if num_rows == INT_MAX
but please check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too familiar with the design of rolling windows in libcudf per se, but the changes mostly make sense to me. Would recommend feedback from another more familiar reviewer if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall. A couple of questions -
/merge |
Description
Having introduced benchmarks of rolling window performance in #17613, we can now look at what happens when cleaning up some the implementation.
Breaking change
Previously, the most general "variable-sized" rolling window function applied a transform iterator (sometimes multiple times) to the user-provided
preceding
andfollowing
columns. We now require, and document, that the user provides values that are always in bounds.This simplifies the book-keeping, and allows us to remove some (now redundant) checks from the rolling kernel.
Performance improvements
In various places, we accept fixed size integers as row-based offsets for rolling windows. We used to materialise these into columns representing the clamped
preceding
andfollowing
columns. However, we did so inconsistently with respect to which side we would clamp (after the change to allow negative window offsets).To fix this, rationalise this clamping in one place, and just use a transform iterator. We now do this consistently for both grouped and ungrouped fixed-size windows.
Benchmark results
Comparing da4accc (the tip of #17613) and this branch:
nvbench compare output
Headline numbers:
All the data
latest/_deps/nvbench-src/scripts/nvbench_compare.py da4accc.json c3973aa36c373e013c6c90f8d7bd58fe0aed6880.json
['da4acccf6b76a93b1e9395b51950a17c45da99c3.json', 'c3973aa36c373e013c6c90f8d7bd58fe0aed6880.json']
row_grouped_rolling_sum
[0] NVIDIA RTX A6000
row_fixed_rolling_sum
[0] NVIDIA RTX A6000
row_variable_rolling_sum
[0] NVIDIA RTX A6000
Summary
Before
Benchmark Results
row_grouped_rolling_sum
[0] NVIDIA RTX A6000
row_fixed_rolling_sum
[0] NVIDIA RTX A6000
row_variable_rolling_sum
[0] NVIDIA RTX A6000
After
Benchmark Results
row_grouped_rolling_sum
[0] NVIDIA RTX A6000
row_fixed_rolling_sum
[0] NVIDIA RTX A6000
row_variable_rolling_sum
[0] NVIDIA RTX A6000
Checklist