-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRAFT: [C++][Parquet] Use num_nulls from DataPageV2 to skip null handling #43955
base: main
Are you sure you want to change the base?
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
@ursabot please benchmark |
Benchmark runs are scheduled for commit c13fe56. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
@mapleFU I came up with this simple optimization. I'm not sure it will make a difference in practice... |
@@ -822,6 +822,7 @@ class ColumnReaderImplBase { | |||
max_size -= def_levels_bytes; | |||
} | |||
|
|||
current_page_may_have_nulls_ = max_def_level_ > 0 || max_rep_level_ > 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a better name or comment for nulls_
? Since max_rep_level_ > 0
and null is weird
I didn't go through this carefully, and I think generally this is ok, for page-v1 it's also ok when num_nulls in stats exists and equal to 0? (Besides, personally we're using column-index to predict. and in parquet community page-v2 is not enabled by default 🤔 Maybe I should pick up the filtering patch ( #39731 ) after my vocation this month... ) |
Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit c13fe56. There were 16 benchmark results indicating a performance regression:
The full Conbench report has more details. |
@@ -2088,6 +2110,10 @@ class FLBARecordReader final : public TypedRecordReader<FLBAType>, | |||
} | |||
|
|||
void ReadValuesSpaced(int64_t values_to_read, int64_t null_count) override { | |||
if (null_count == 0) { | |||
ReadValuesDense(values_to_read); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked FLBARecordReader
, it's ReadDense uses a extra null_bitmap_builder_
, sigh
After read this part of code this idea lgtm |
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?