-
Notifications
You must be signed in to change notification settings - Fork 918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the ORC decoding bug for the timestamp data #17570
Fix the ORC decoding bug for the timestamp data #17570
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test |
43fdca1
to
88cfa09
Compare
/ok to test |
88cfa09
to
6a8280f
Compare
/ok to test |
6a8280f
to
e9f4328
Compare
/ok to test |
2 similar comments
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some old comment, not sure if applicable still
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple small suggestions.
ab4221d
to
0312873
Compare
Should we run a benchmark on this patch to see how much performance impact it causes? |
Are there Spark-RAPIDS benchmarks that we can (also) run to check the impact? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments. Overall looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be on vacation for the next week and I don't want to block this PR, so I'm just leaving comments without requesting blocking changes. Feel free to ping me if you have thoughts though!
__shared__ run_cache_manager run_cache_manager_inst; | ||
cache_helper cache_helper_inst(run_cache_manager_inst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to make any changes to the shared memory allocation upon launching the kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size of the shared memory needed won't change throughout the kernel execution, hence the static allocation. Does this answer your question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong: I don't understand why the shared memory size of a kernel does not change when we add a new shared memory object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has changed, but we don't need to declare this explicitly; size of the shared memory is known at compile time, so CUDA takes care of this for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this makes sense. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run_cache_manager
increases the shared memory usage per block by 12 bytes. Quite negligible in comparison to the existing usage by orcdec_state_s
, which takes 37,776 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ttnghia Yeah that's a lot of shared memory to use. The static arrays in orcdec_state_s
such as the intermediate "byte streams" and the intermediate decoded output are the biggest contributors. This is required by the current design where each block has a hardcoded number of 1024 threads to be able to consume two 512-length runs at a time. We may improve this part in the future when needed.
… selected snappy data with the correct uncompressed version
c2a2992
to
9c87960
Compare
/merge |
4e97cd4
into
rapidsai:branch-25.02
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Description
This PR introduces a band-aid class
run_cache_manager
to handle an exceptional case in TIMESTAMP data type, where the DATA stream (seconds) is processed ahead of SECONDARY stream (nanoseconds) and the excess rows are lost. The fix usesrun_cache_manager
(and alsocache_helper
, which is an implementation detail) to cache the potentially missed data from the DATA stream and let them be used in the next decoding iteration, thus preventing data loss.Closes #17155
Checklist