Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2249: Add nan_count to handle NaNs in statistics #196
base: master
Are you sure you want to change the base?
PARQUET-2249: Add nan_count to handle NaNs in statistics #196
Changes from 2 commits
2f3449e
aecffd7
8d31bff
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect to explicitly state that
NaN value should not be written to min or max fields in the Statistics of DataPageHeader, DataPageHeaderV2 and ColumnMetaData. But it is suggested to write NaN to min_values and max_values fields in the ColumnIndex where a value has to be written in case of a only-NaN page
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update this with my next revision once we have decided on this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update this with my next revision once we have decided on this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? We already know:
nan_counts
)null_counts
)It seems this might be enough to infer whether a page is all-NaN (except perhaps if there are repetition levels?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, if we do need an additional list (because of repeated columns?), it might be more worthwhile to add an
optional list<i64> value_counts
instead, as it would then benefit all column types.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think
optional list<i64> value_counts
is more common, but I think null already hasnull_counts
, andvalue_counts
might consume more bytes for every leaf column.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As
nan_counts
will be set only after this proposal, could we simply deduce a NaN page by checkingnull_pages[i] == false && nan_counts[i] > 0 && min_values[i] == NaN && max_values[i] == NaN
? If that is true, we can safely remove definition ofnan_pages
list.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, number of rows in the offset index isn't enough due to repeated values.
Apart from this, the suggestions seem to turn a bit in circles now. Note that all suggestions in this thread were already mentioned in my earlier post where I depicted our possible options for the column index.
@pitrou what you mentioned was my Option 2. I personally would prefer this as it feels like a useful thing to have anyway. Having said that, others pointed rightfully out that it would cost a few bytes even for non float columns. The value might be valuable for other tasks as well. For example, it could be used to quickly check how many nested values are in a page. By having these values one could sum up the nested values per column chunk by adding up all the value counts. This is currently a value that cannot be optained at all through statistics; instead one has to decode pages and count. For example, the SQL query
SELECT count(*) FROM some_nested_column;
could be fully answered with such a value_counts field.@wgtmac your proposal was my Option 1 and actually my initial proposal (see previous commit). Note that you earlier actually were against writing NaNs and rather preferred the nan_pages approach:
Is your argument that if we now need to write the NaNs anyway, that we should in this case just use them instead of adding nan_pages? I do agree that this would save the extra field and I personally see nothing wrong in doing this. Readers need to be able to detect NaN values anyway (to ignore them), so readers should be able to use the same logic to determin min=max=NaN <=> all values are NaN.
As mentioned in my previous post where I compared the three approaches, I am happy to implement any of them and I think all of them will fulfill the requirements. In my personal opinion, I like the current approach with nan_pages actually the least, as it seems redundant if we have to write NaN values anyway and I see no problem in using NaN values for the "all values NaN check".
I also like the option of adding a value_counts field to the column index of all columns. It feels like a useful and missing field (that is not subsumed by offset index row counts for nested columns) and I would love to add it as well and I feel the few extra bytes will be so negligible in contrast to the actual data that no-one will ever care. Also it would enable us to do the check for all values NaN the same way in page statistics and in the column index.
So we're back at the three options I proposed:
nan_pages
@wgtmac @mapleFU @gszadovszky @pitrou could we arrive at a consensus here? I'm happy to adapt my PR to any of the solutions. @gszadovszky you also haven't mentioned your favorite, yet (you just pointed out that we have to write some valid value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou writing anything but NaN into min/max was one of my suggestions to circumvent the problem that parquet-mr doesn't seem to check for NaN values in min/max while reading and therefore would probably yield wrong results once we start writing NaNs into these values.
This would only work if we go back to maintaining either
nan_pages
orvalue_counts
though, as otherwise, as you correctly pointed out, we wouldn't have a way to draw the distinction between only-NaN and real infinities.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi gabor, which implemention has do like that? I check C++ implemention but it doesn't do this. Maybe we can do a check here? Since I guess
[-inf, +inf]
sounds okThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mapleFU, I did not think about any specific implementation. (TBH, I only have experince with parquet-mr.) This is mentioned in the PR description. Maybe, we do not have any implementations as such.
@JFinis, I agree we should not care about the potential systems already writing NaN values into column indexes. Also agree that writing NaN values to min/max is risky for existing systems. So we need to write non-NaN valid values to min/max for all-NaN pages. (And of course mark them with either
nan_pages
orvalue_counts
.)The more we narrow the range the higher the chance the page will be dropped during filtering which is good because we should not search for NaN values based on the spec anyway. What do you think about
[-Inf, -Inf]
? The worst case is we will read the page of all NaN values instead of dropping. In this very case we would not drop it for< x
like cases. (This turned out to be the rephrasing and summary of your previous comments. 😄 )There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
[-inf, +inf]
it's ok. Now I guess only Rust impl and parquet-mr has the potential problem.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to catching up the discussion. I like the idea to write either [-inf, +inf] or [-0.0, +0.0] for NaN-only pages.
As NaN value does not have a well-defined order across systems, simply leveraging page min/max values to filter NaN does not make any sense. Therefore I think this design can break such misuses.