Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mukernels strings #17286

Merged
merged 87 commits into from
Jan 15, 2025
Merged

Conversation

pmattione-nvidia
Copy link
Contributor

@pmattione-nvidia pmattione-nvidia commented Nov 8, 2024

Moves parquet string decoding from its stand-alone kernel to using the templated generic kernel. To optimize performance, the scheme for copying values to the output has changed. The details of this scheme are in the gpuDecodeString(), but briefly:

The block size is 128 threads. We try to have the threads in the block share the copying work such that, each thread copies (up to) 4 bytes per memcpy (showed the best performance). So, for a given batch of strings, the longer the average string size is, the more threads that work together to copy it. We cap this at 32 threads per string (a whole warp) for strings longer than 64 bytes (if length 65, 16 threads would require copying 5 chars each).

For short strings we use a minimum of 4 threads per string: this results in at most 32 simultaneous string copies. We can't go more than 32 simultaneous copies because performance decreases. This is presumably because on a cache hit, the cache line size is 128 bytes, and with so many threads running across the blocks we run out of room in the cache.

Benchmark Results (Gaussian-distributed string lengths):

  • NO dictionary, length from 0 - 32: No difference
  • NO dictionary, larger lengths (32 - 64, 16 - 80, 64 - 128, etc.): 10% - 20% faster.
  • Dictionary, cardinality 0: 0% - 15% faster.
  • Dictionary, cardinality 1000, length from 0 - 32: 30% - 35% faster.
  • Dictionary, cardinality 1000, larger lengths (32 - 64, 16 - 80, 64 - 128, etc.): 50% - 60% faster.
  • Selected customer data: 5% faster.

These performance improvements also hold for this previous long-string performance issue. The primary source of these improvements is having all 128 threads in the block helping to copy the string, whereas before we were only using one warp to do the copy (due to the caching issues). The performance of the non-dictionary and zero-cardinality results are limited because we are bound by the time needed to copy the string data from global memory. For cardinality 1000 dictionary data, the requested strings are often still in the cache and the full benefit of the better thread utilization can be realized.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

pmattione-nvidia and others added 30 commits August 12, 2024 16:31
cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/decode_fixed.cu Show resolved Hide resolved
cpp/src/io/parquet/page_decode.cuh Outdated Show resolved Hide resolved
Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment. Are there any additional benchmarks worth adding here? Something related to Ed's long-strings examples?

@pmattione-nvidia
Copy link
Contributor Author

I think one thing to do here right off the bat is to move the string stuff out of decode_fixed.cu and into something new like decode_strings.cu or the like.

Outside the scope of this PR, I think we should spend some time cleaning up this whole folder. Things have gotten haphazard with respect to where the various decoders and support functions live. @vuule, @etseidl we should set up a little discussion after the new year to see if we can make things a little more uniform.

I've moved the strings-specific decode functions into page_string_utils.cuh

@etseidl
Copy link
Contributor

etseidl commented Jan 7, 2025

I'm slowly getting back up to speed after the break (and am fighting security own-goals on my workstation), but I've managed to test this with some of my existing files. For the most part it seems great and has decreased decode times 👍.

But for one file I'm having issues that need to be tracked down. @nvdbaranec do you still have the "curand" file I sent you some time ago? That's the horribly nested one with rows that span pages. I'm finding that I'm not able to read that one correctly (might be user error, so need to pin this down better on my end). If you don't have it any more I can send it again via slack so you all can test it out.

@pmattione-nvidia
Copy link
Contributor Author

One last comment. Are there any additional benchmarks worth adding here? Something related to Ed's long-strings examples?

The main thing missing from the current benchmarks is a test for long strings; by default it's generating strings that are all < 32 chars in length. I can add a test that increases the limit up to 128 or something.

@pmattione-nvidia
Copy link
Contributor Author

pmattione-nvidia commented Jan 8, 2025

One last comment. Are there any additional benchmarks worth adding here? Something related to Ed's long-strings examples?

The main thing missing from the current benchmarks is a test for long strings; by default it's generating strings that are all < 32 chars in length. I can add a test that increases the limit up to 128 or something.

This is now added. Times are all the same or faster.

@etseidl
Copy link
Contributor

etseidl commented Jan 8, 2025

Still running this to ground...I've found that the string offsets for the pages that I believe contain part of a single row are all 0. I'm not sure if the string data is copied for those pages yet or not.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, just one possible suggestion

cpp/src/io/parquet/page_string_utils.cuh Show resolved Hide resolved
@pmattione-nvidia
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 834565a into rapidsai:branch-25.02 Jan 15, 2025
105 of 106 checks passed
pmattione-nvidia added a commit that referenced this pull request Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants