-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mark column chunks in a PQ reader pass
as large strings when the cumulative offsets
exceeds the large strings threshold.
#17207
Conversation
|
||
int constexpr multiplier = 12; | ||
std::vector<cudf::column_view> input_cols(multiplier, input->view()); | ||
auto col0 = cudf::concatenate(input_cols); ///< 2.70GB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same column from GTest: CaseTest.ToLower
pass
as large strings when the cumulative offsets
exceeds the large strings threshold.
pass
as large strings when the cumulative offsets
exceeds the large strings threshold.pass
as large strings when the cumulative offsets
exceeds the large strings threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comments; the core of the change looks good.
Co-authored-by: Vukasin Milovanovic <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for iterating on this!
/merge |
…n the cumulative `offsets` exceeds the large strings threshold. (rapidsai#17207)" This reverts commit e52ce85.
hi @mhaseeb123 , I have a silly question (because I'm not very familiar with CUDF code): why are cumulative string offsets counted per pass instead of per subpass? We have a parquet file whose string column in one RG alone is more than 2G, after #17207 such column will be treated as large string column and thus become unacceptable in Spark-Rapids (Spark-Rapids does not allow large columns, as described in #16215, and AFAIK this issue is not easy to fix) If cumulative string offsets are counted per subpass, then we can keep using normal string columns in many cases. Is there any particular reason stopping us from going with subpass? |
@binmahone @mhaseeb123 |
Yes, you are correct. I updated the behavior accordingly in #17693 and it solves both @binmahone's comment and the original issue this PR aimed to solve. |
…ze exceeds threshold (#17693) This PR improves on #17207 and only uses 64-bit offsets if the current output chunk of a strings column exceeds the large-strings threshold instead of using cumulative strings column sizes per `pass` or `row group` level. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Karthikeyan (https://github.com/karthikeyann) - David Wendt (https://github.com/davidwendt) - Yunsong Wang (https://github.com/PointKernel) URL: #17693
Description
This PR implements a method to correctly set the large-string property for column chunks in a in the Chunked Parquet Reader subpass if the cumulative string offsets have exceeded the large strings threshold.
Fixes #17158
Checklist