-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mukernels strings #17286
Mukernels strings #17286
Conversation
…lable column code
…attione-nvidia/cudf into mukernels_fixedwidth_optimize
Co-authored-by: nvdbaranec <[email protected]>
Co-authored-by: nvdbaranec <[email protected]>
Co-authored-by: nvdbaranec <[email protected]>
…attione-nvidia/cudf into mukernels_fixedwidth_optimize
Co-authored-by: Vukasin Milovanovic <[email protected]>
…attione-nvidia/cudf into mukernels_fixedwidth_optimize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last comment. Are there any additional benchmarks worth adding here? Something related to Ed's long-strings examples?
I've moved the strings-specific decode functions into page_string_utils.cuh |
I'm slowly getting back up to speed after the break (and am fighting security own-goals on my workstation), but I've managed to test this with some of my existing files. For the most part it seems great and has decreased decode times 👍. But for one file I'm having issues that need to be tracked down. @nvdbaranec do you still have the "curand" file I sent you some time ago? That's the horribly nested one with rows that span pages. I'm finding that I'm not able to read that one correctly (might be user error, so need to pin this down better on my end). If you don't have it any more I can send it again via slack so you all can test it out. |
The main thing missing from the current benchmarks is a test for long strings; by default it's generating strings that are all < 32 chars in length. I can add a test that increases the limit up to 128 or something. |
This is now added. Times are all the same or faster. |
Still running this to ground...I've found that the string offsets for the pages that I believe contain part of a single row are all 0. I'm not sure if the string data is copied for those pages yet or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great, just one possible suggestion
…ia/cudf into mukernels_strings
/merge |
834565a
into
rapidsai:branch-25.02
This reverts commit 834565a.
Moves parquet string decoding from its stand-alone kernel to using the templated generic kernel. To optimize performance, the scheme for copying values to the output has changed. The details of this scheme are in the gpuDecodeString(), but briefly:
The block size is 128 threads. We try to have the threads in the block share the copying work such that, each thread copies (up to) 4 bytes per memcpy (showed the best performance). So, for a given batch of strings, the longer the average string size is, the more threads that work together to copy it. We cap this at 32 threads per string (a whole warp) for strings longer than 64 bytes (if length 65, 16 threads would require copying 5 chars each).
For short strings we use a minimum of 4 threads per string: this results in at most 32 simultaneous string copies. We can't go more than 32 simultaneous copies because performance decreases. This is presumably because on a cache hit, the cache line size is 128 bytes, and with so many threads running across the blocks we run out of room in the cache.
Benchmark Results (Gaussian-distributed string lengths):
These performance improvements also hold for this previous long-string performance issue. The primary source of these improvements is having all 128 threads in the block helping to copy the string, whereas before we were only using one warp to do the copy (due to the caching issues). The performance of the non-dictionary and zero-cardinality results are limited because we are bound by the time needed to copy the string data from global memory. For cardinality 1000 dictionary data, the requested strings are often still in the cache and the full benefit of the better thread utilization can be realized.
Checklist