Mukernels strings #17286

pmattione-nvidia · 2024-11-08T20:50:43Z

Moves parquet string decoding from its stand-alone kernel to using the templated generic kernel. To optimize performance, the scheme for copying values to the output has changed. The details of this scheme are in the gpuDecodeString(), but briefly:

The block size is 128 threads. We try to have the threads in the block share the copying work such that, each thread copies (up to) 4 bytes per memcpy (showed the best performance). So, for a given batch of strings, the longer the average string size is, the more threads that work together to copy it. We cap this at 32 threads per string (a whole warp) for strings longer than 64 bytes (if length 65, 16 threads would require copying 5 chars each).

For short strings we use a minimum of 4 threads per string: this results in at most 32 simultaneous string copies. We can't go more than 32 simultaneous copies because performance decreases. This is presumably because on a cache hit, the cache line size is 128 bytes, and with so many threads running across the blocks we run out of room in the cache.

Benchmark Results (Gaussian-distributed string lengths):

NO dictionary, length from 0 - 32: No difference
NO dictionary, larger lengths (32 - 64, 16 - 80, 64 - 128, etc.): 10% - 20% faster.
Dictionary, cardinality 0: 0% - 15% faster.
Dictionary, cardinality 1000, length from 0 - 32: 30% - 35% faster.
Dictionary, cardinality 1000, larger lengths (32 - 64, 16 - 80, 64 - 128, etc.): 50% - 60% faster.
Selected customer data: 5% faster.

These performance improvements also hold for this previous long-string performance issue. The primary source of these improvements is having all 128 threads in the block helping to copy the string, whereas before we were only using one warp to do the copy (due to the caching issues). The performance of the non-dictionary and zero-cardinality results are limited because we are bound by the time needed to copy the string data from global memory. For cardinality 1000 dictionary data, the requested strings are often still in the cache and the full benefit of the better thread utilization can be realized.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…lable column code

…attione-nvidia/cudf into mukernels_fixedwidth_optimize

Co-authored-by: nvdbaranec <[email protected]>

…attione-nvidia/cudf into mukernels_fixedwidth_optimize

Co-authored-by: Vukasin Milovanovic <[email protected]>

…attione-nvidia/cudf into mukernels_fixedwidth_optimize

cpp/src/io/parquet/decode_fixed.cu

cpp/src/io/parquet/page_decode.cuh

nvdbaranec

One last comment. Are there any additional benchmarks worth adding here? Something related to Ed's long-strings examples?

pmattione-nvidia · 2025-01-07T20:28:54Z

I think one thing to do here right off the bat is to move the string stuff out of decode_fixed.cu and into something new like decode_strings.cu or the like.

Outside the scope of this PR, I think we should spend some time cleaning up this whole folder. Things have gotten haphazard with respect to where the various decoders and support functions live. @vuule, @etseidl we should set up a little discussion after the new year to see if we can make things a little more uniform.

I've moved the strings-specific decode functions into page_string_utils.cuh

etseidl · 2025-01-07T21:54:49Z

I'm slowly getting back up to speed after the break (and am fighting security own-goals on my workstation), but I've managed to test this with some of my existing files. For the most part it seems great and has decreased decode times 👍.

But for one file I'm having issues that need to be tracked down. @nvdbaranec do you still have the "curand" file I sent you some time ago? That's the horribly nested one with rows that span pages. I'm finding that I'm not able to read that one correctly (might be user error, so need to pin this down better on my end). If you don't have it any more I can send it again via slack so you all can test it out.

pmattione-nvidia · 2025-01-07T21:59:27Z

One last comment. Are there any additional benchmarks worth adding here? Something related to Ed's long-strings examples?

The main thing missing from the current benchmarks is a test for long strings; by default it's generating strings that are all < 32 chars in length. I can add a test that increases the limit up to 128 or something.

pmattione-nvidia · 2025-01-08T00:37:54Z

One last comment. Are there any additional benchmarks worth adding here? Something related to Ed's long-strings examples?

The main thing missing from the current benchmarks is a test for long strings; by default it's generating strings that are all < 32 chars in length. I can add a test that increases the limit up to 128 or something.

This is now added. Times are all the same or faster.

etseidl · 2025-01-08T01:14:13Z

Still running this to ground...I've found that the string offsets for the pages that I believe contain part of a single row are all 0. I'm not sure if the string data is copied for those pages yet or not.

cpp/src/io/parquet/decode_fixed.cu

vuule

looks great, just one possible suggestion

cpp/src/io/parquet/page_string_utils.cuh

…ia/cudf into mukernels_strings

cpp/src/io/parquet/page_decode.cuh

pmattione-nvidia · 2025-01-15T16:09:02Z

/merge

This reverts commit 834565a.

pmattione-nvidia and others added 30 commits August 12, 2024 16:31

work in progress

b5ec22e

Further work in list code

2ca9618

Tests working

4b5f91a

Revert page_decode changes

ead17b8

Merge branch 'branch-24.10' into parquet_list_kernel

cc32409

Add debugging

0dccec5

Tests working

e239e79

Merge branch 'branch-24.10' into parquet_list_kernel

8f25453

compile fixes

24c9ab1

No need to decode def levels if not nullable

342c2f4

Manual block scan

50bbc94

Optimize parquet reader block scans, simplify and consolidate non-nul…

5390661

…lable column code

tweak syncing

3ef7b0d

small tweaks

7882879

Merge branch 'branch-24.10' into parquet_list_kernel

8852839

Add skipping to rle_stream, use for lists (chunked reads)

e285fbf

tweak scan interface for linked lists

254f3e9

Merge branch 'branch-24.12' into mukernels_fixedwidth_optimize

18d989c

style fixes

8ea1e0e

Merge branch 'mukernels_fixedwidth_optimize' of https://github.com/pm…

326b386

…attione-nvidia/cudf into mukernels_fixedwidth_optimize

Update cpp/src/io/parquet/decode_fixed.cu

41cb982

Co-authored-by: nvdbaranec <[email protected]>

Update cpp/src/io/parquet/decode_fixed.cu

6e70554

Co-authored-by: nvdbaranec <[email protected]>

Update cpp/src/io/parquet/decode_fixed.cu

9ad4415

Co-authored-by: nvdbaranec <[email protected]>

Unroll block-count loop

3a1fc95

Merge branch 'mukernels_fixedwidth_optimize' of https://github.com/pm…

0babf46

…attione-nvidia/cudf into mukernels_fixedwidth_optimize

more style fixes

5ab9829

Merge branch 'branch-24.12' into mukernels_fixedwidth_optimize

310d50c

Disable manual block scan for non-lists

4471022

Update cpp/src/io/parquet/decode_fixed.cu

c0ed2cb

Co-authored-by: Vukasin Milovanovic <[email protected]>

Merge branch 'mukernels_fixedwidth_optimize' of https://github.com/pm…

c2139ef

…attione-nvidia/cudf into mukernels_fixedwidth_optimize

nvdbaranec reviewed Dec 20, 2024

View reviewed changes

cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/decode_fixed.cu Show resolved Hide resolved

cpp/src/io/parquet/page_decode.cuh Outdated Show resolved Hide resolved

pmattione-nvidia and others added 3 commits January 3, 2025 18:05

Merge branch 'branch-25.02' into mukernels_strings

dceedc0

Address PR comments

89537fe

Update year

1da862f

nvdbaranec reviewed Jan 7, 2025

View reviewed changes

Move string decode to utils file

90c6cf2

Add benchmark for long strings

8bd9ea9

etseidl reviewed Jan 8, 2025

View reviewed changes

cpp/src/io/parquet/decode_fixed.cu Outdated Show resolved Hide resolved

pmattione-nvidia added 2 commits January 8, 2025 17:26

Merge branch 'branch-25.02' into mukernels_strings

1a7eb2a

Merge branch 'branch-25.02' into mukernels_strings

8712b53

nvdbaranec approved these changes Jan 9, 2025

View reviewed changes

ttnghia reviewed Jan 13, 2025

View reviewed changes

cpp/src/io/parquet/decode_fixed.cu Show resolved Hide resolved

Merge branch 'branch-25.02' into mukernels_strings

d360200

vuule reviewed Jan 13, 2025

View reviewed changes

cpp/src/io/parquet/page_string_utils.cuh Show resolved Hide resolved

pmattione-nvidia added 2 commits January 13, 2025 15:18

Scope scan storage

6cfd105

Merge branch 'mukernels_strings' of https://github.com/pmattione-nvid…

a253087

…ia/cudf into mukernels_strings

kingcrimsontianyu reviewed Jan 13, 2025

View reviewed changes

cpp/src/io/parquet/page_decode.cuh Show resolved Hide resolved

Remove inline

46d67ec

vuule approved these changes Jan 13, 2025

View reviewed changes

Merge branch 'branch-25.02' into mukernels_strings

5304ba1

rapids-bot bot merged commit 834565a into rapidsai:branch-25.02 Jan 15, 2025
105 of 106 checks passed

pmattione-nvidia mentioned this pull request Jan 21, 2025

Improve parquet reader very-long string performance #17773

Open

3 tasks

pmattione-nvidia added a commit that referenced this pull request Jan 22, 2025

Revert "Mukernels strings (#17286)"

43d4a85

This reverts commit 834565a.

pmattione-nvidia mentioned this pull request Jan 22, 2025

Revert "Mukernels strings" #17780

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mukernels strings #17286

Mukernels strings #17286

pmattione-nvidia commented Nov 8, 2024 •

edited

Loading

nvdbaranec left a comment

pmattione-nvidia commented Jan 7, 2025

etseidl commented Jan 7, 2025

pmattione-nvidia commented Jan 7, 2025

pmattione-nvidia commented Jan 8, 2025 •

edited

Loading

etseidl commented Jan 8, 2025

vuule left a comment

pmattione-nvidia commented Jan 15, 2025

Mukernels strings #17286

Mukernels strings #17286

Conversation

pmattione-nvidia commented Nov 8, 2024 • edited Loading

Benchmark Results (Gaussian-distributed string lengths):

Checklist

nvdbaranec left a comment

Choose a reason for hiding this comment

pmattione-nvidia commented Jan 7, 2025

etseidl commented Jan 7, 2025

pmattione-nvidia commented Jan 7, 2025

pmattione-nvidia commented Jan 8, 2025 • edited Loading

etseidl commented Jan 8, 2025

vuule left a comment

Choose a reason for hiding this comment

pmattione-nvidia commented Jan 15, 2025

pmattione-nvidia commented Nov 8, 2024 •

edited

Loading

pmattione-nvidia commented Jan 8, 2025 •

edited

Loading