Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2249: Add nan_count to handle NaNs in statistics #196

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 1 addition & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,21 +161,7 @@ following rules:
* FLOAT, DOUBLE - Signed comparison with special handling of NaNs and
signed zeros. The details are documented in the
[Thrift definition](src/main/thrift/parquet.thrift) in the
`ColumnOrder` union. They are summarized here but the Thrift definition
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
is considered authoritative:
* NaNs should not be written to min or max statistics fields.
* If the computed max value is zero (whether negative or positive),
`+0.0` should be written into the max statistics field.
* If the computed min value is zero (whether negative or positive),
`-0.0` should be written into the min statistics field.

For backwards compatibility when reading files:
* If the min is a NaN, it should be ignored.
* If the max is a NaN, it should be ignored.
* If the min is +0, the row group may contain -0 values as well.
* If the max is -0, the row group may contain +0 values as well.
* When looking for NaN values, min and max should be ignored.

`ColumnOrder` union.
* BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic unsigned byte-wise
comparison.

Expand Down
46 changes: 41 additions & 5 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,7 @@ struct Statistics {
*/
1: optional binary max;
2: optional binary min;
/** count of null value in the column */
/** count of null values in the column */
3: optional i64 null_count;
/** count of distinct values occurring */
4: optional i64 distinct_count;
Expand All @@ -223,6 +223,11 @@ struct Statistics {
*/
5: optional binary max_value;
6: optional binary min_value;
/**
* count of NaN values in the column; only present if physical type is FLOAT
* or DOUBLE
*/
7: optional i64 nan_count;
}

/** Empty structs to use as logical type annotations */
Expand Down Expand Up @@ -886,16 +891,25 @@ union ColumnOrder {
* FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
* (*) Because the sorting order is not specified properly for floating
* point values (relations vs. total ordering) the following
* compatibility rules should be applied when reading statistics:
* point values (relations vs. total ordering), the following compatibility
* rules should be applied when reading statistics:
* - If the min is a NaN, it should be ignored.
* - If the max is a NaN, it should be ignored.
* - If the nan_count field is set, a reader can compute
* nan_count + null_count == num_values to deduce whether all non-NULL
* values are NaN.
* - When looking for NaN values, min and max should be ignored.
* If the nan_count field is set, it can be used to check whether
* NaNs are present.
* - If the min is +0, the row group may contain -0 values as well.
* - If the max is -0, the row group may contain +0 values as well.
* - When looking for NaN values, min and max should be ignored.
*
* When writing statistics the following rules should be followed:
* - NaNs should not be written to min or max statistics fields.
* - It is suggested to always set the nan_count fields for FLOAT and
DOUBLE columns.
* - NaNs should not be written to min or max statistics fields except
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect to explicitly state that NaN value should not be written to min or max fields in the Statistics of DataPageHeader, DataPageHeaderV2 and ColumnMetaData. But it is suggested to write NaN to min_values and max_values fields in the ColumnIndex where a value has to be written in case of a only-NaN page.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update this with my next revision once we have decided on this issue.

* in the column index, where a value has to be written incase of
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* in the column index, where a value has to be written incase of
* in the column index, where a value has to be written in case of

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update this with my next revision once we have decided on this issue.

* only-NaN pages.
* - If the computed max value is zero (whether negative or positive),
* `+0.0` should be written into the max statistics field.
* - If the computed min value is zero (whether negative or positive),
Expand Down Expand Up @@ -952,6 +966,11 @@ struct ColumnIndex {
* Such more compact values must still be valid values within the column's
* logical type. Readers must make sure that list entries are populated before
* using them by inspecting null_pages.
* For columns of type FLOAT and DOUBLE, NaN values are not to be included
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
* in these bounds unless all non-null values in a page are NaN, in which
* case min and max should be set to NaN. Readers should always ignore NaN
* values in the bounds; they should check nan_pages to detect the "all
* non-null values are NaN" case.
*/
2: required list<binary> min_values
3: required list<binary> max_values
Expand All @@ -966,6 +985,23 @@ struct ColumnIndex {

/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts

/**
* A list of Boolean values to determine pages that contain only NaNs. Only
* present for columns of type FLOAT and DOUBLE. If true, all non-null
* values in a page are NaN. Writers are suggested to set the corresponding
* entries in min_values and max_values to NaN, so that all lists have the same
* length and contain valid values. If false, then either all values in the
* page are null or there is at least one non-null non-NaN value in the page.
* As readers are supposed to ignore all NaN values in bounds, legacy readers
* who do not consider nan_pages yet are still able to use the column index
* but are not able to skip only-NaN pages.
*/
6: optional list<bool> nan_pages
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? We already know:

  • the NaN count for each page (in nan_counts)
  • the null count for each page (in null_counts)
  • the number of rows for each page (from the OffsetIndex)

It seems this might be enough to infer whether a page is all-NaN (except perhaps if there are repetition levels?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, if we do need an additional list (because of repeated columns?), it might be more worthwhile to add an optional list<i64> value_counts instead, as it would then benefit all column types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think optional list<i64> value_counts is more common, but I think null already has null_counts, and value_counts might consume more bytes for every leaf column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As nan_counts will be set only after this proposal, could we simply deduce a NaN page by checking null_pages[i] == false && nan_counts[i] > 0 && min_values[i] == NaN && max_values[i] == NaN? If that is true, we can safely remove definition of nan_pages list.

Copy link
Contributor Author

@JFinis JFinis Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, number of rows in the offset index isn't enough due to repeated values.

Apart from this, the suggestions seem to turn a bit in circles now. Note that all suggestions in this thread were already mentioned in my earlier post where I depicted our possible options for the column index.

@pitrou what you mentioned was my Option 2. I personally would prefer this as it feels like a useful thing to have anyway. Having said that, others pointed rightfully out that it would cost a few bytes even for non float columns. The value might be valuable for other tasks as well. For example, it could be used to quickly check how many nested values are in a page. By having these values one could sum up the nested values per column chunk by adding up all the value counts. This is currently a value that cannot be optained at all through statistics; instead one has to decode pages and count. For example, the SQL query SELECT count(*) FROM some_nested_column; could be fully answered with such a value_counts field.

@wgtmac your proposal was my Option 1 and actually my initial proposal (see previous commit). Note that you earlier actually were against writing NaNs and rather preferred the nan_pages approach:

Personally speaking, apart from adding a nan_count to the statistics, I would go with the option 3: adding a nan_pages bool list to the column index. I am not in favor of writing any NaN to min/max bounds.

Is your argument that if we now need to write the NaNs anyway, that we should in this case just use them instead of adding nan_pages? I do agree that this would save the extra field and I personally see nothing wrong in doing this. Readers need to be able to detect NaN values anyway (to ignore them), so readers should be able to use the same logic to determin min=max=NaN <=> all values are NaN.

As mentioned in my previous post where I compared the three approaches, I am happy to implement any of them and I think all of them will fulfill the requirements. In my personal opinion, I like the current approach with nan_pages actually the least, as it seems redundant if we have to write NaN values anyway and I see no problem in using NaN values for the "all values NaN check".

I also like the option of adding a value_counts field to the column index of all columns. It feels like a useful and missing field (that is not subsumed by offset index row counts for nested columns) and I would love to add it as well and I feel the few extra bytes will be so negligible in contrast to the actual data that no-one will ever care. Also it would enable us to do the check for all values NaN the same way in page statistics and in the column index.

So we're back at the three options I proposed:

  1. Drop nan_pages and use my initial approach of "min=max=NaN && nan_counts > 0 <=> all values are NaN" in the column index
  2. Drop nan_pages and instead add value_counts so we can use value_counts-null_counts==nan_counts to determine whether all values are null. (My personal favorite)
  3. Retain the current state and use nan_pages

@wgtmac @mapleFU @gszadovszky @pitrou could we arrive at a consensus here? I'm happy to adapt my PR to any of the solutions. @gszadovszky you also haven't mentioned your favorite, yet (you just pointed out that we have to write some valid value).

Copy link
Contributor Author

@JFinis JFinis Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou writing anything but NaN into min/max was one of my suggestions to circumvent the problem that parquet-mr doesn't seem to check for NaN values in min/max while reading and therefore would probably yield wrong results once we start writing NaNs into these values.

This would only work if we go back to maintaining either nan_pages or value_counts though, as otherwise, as you correctly pointed out, we wouldn't have a way to draw the distinction between only-NaN and real infinities.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was an argument that some writers already write NaN values into column indexes. Hence, they try to filter on NaN values. Now, we start writing [-Inf,+Inf] for NaN only pages. NaN is probably out of [-Inf,+Inf] interval so that reader would drop the only NaN page while searching for a NaN.

Hi gabor, which implemention has do like that? I check C++ implemention but it doesn't do this. Maybe we can do a check here? Since I guess [-inf, +inf] sounds ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mapleFU, I did not think about any specific implementation. (TBH, I only have experince with parquet-mr.) This is mentioned in the PR description. Maybe, we do not have any implementations as such.

@JFinis, I agree we should not care about the potential systems already writing NaN values into column indexes. Also agree that writing NaN values to min/max is risky for existing systems. So we need to write non-NaN valid values to min/max for all-NaN pages. (And of course mark them with either nan_pages or value_counts.)

The more we narrow the range the higher the chance the page will be dropped during filtering which is good because we should not search for NaN values based on the spec anyway. What do you think about [-Inf, -Inf]? The worst case is we will read the page of all NaN values instead of dropping. In this very case we would not drop it for < x like cases. (This turned out to be the rephrasing and summary of your previous comments. 😄 )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think [-inf, +inf] it's ok. Now I guess only Rust impl and parquet-mr has the potential problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to catching up the discussion. I like the idea to write either [-inf, +inf] or [-0.0, +0.0] for NaN-only pages.

As NaN value does not have a well-defined order across systems, simply leveraging page min/max values to filter NaN does not make any sense. Therefore I think this design can break such misuses.


/** A list containing the number of NaN values for each page. Only present
* for columns of type FLOAT and DOUBLE. **/
7: optional list<i64> nan_counts
}

struct AesGcmV1 {
Expand Down