Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: simplify DataSource statistics code #8172

Closed
wants to merge 1 commit into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Nov 14, 2023

Which issue does this PR close?

This is part of #8078

Rationale for this change

I am trying to improve the handling of statistics by changing the internal representation, so I would like less to change.

My longer term goal is to keep all the calculation and comparison code for Statistics within the Statistics class itself, which will make testing and changing implementation details

What changes are included in this PR?

  1. remove some direct manipulation of Precision and use pre-existing methods min, max and add

Are these changes tested?

Covered by existing tests

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Nov 14, 2023
}
}

/// If the given value is numerically lesser than the original minimum value,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here


/// If the given value is numerically greater than the original maximum value,
/// return the new maximum value with appropriate exactness information.
fn set_max_if_greater(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a similar difference here as well. Max values are conserved by relaxing the exactness when an absent statistic is read from the file.

@@ -160,17 +159,6 @@ pub(crate) fn create_max_min_accs(
(max_values, min_values)
}

fn add_row_stats(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they don't share the same behavior, but it didn't show up due to lack of testing. The add() function operates in the safest way; if one of the operands is absent, the result will also be absent. On the other hand, the remove() function keeps the non-absent value by changing its exactness (absent + exact(value) => inexact(value)).

@alamb alamb marked this pull request as ready for review November 14, 2023 16:26
@alamb
Copy link
Contributor Author

alamb commented Nov 14, 2023

cc @berkaysynnada and @ozankabak

@alamb alamb changed the title Minor: simplify DataSource statistics Minor: simplify DataSource statistics code Nov 14, 2023
Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you

Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functions such as add(), min(), and max(), which I initially introduced in the precision API, were designed to operate in the safest way. However, I observe that the common usage of these functions in the code is trying to preserve statistical values across parameter updates by converting them to inexact values. What are your thoughts on modifying the functionalities of the precision methods instead? I mean these precision methods' logic could be replaced by the functions you removed.

Thanks for the simplifications

@@ -160,17 +159,6 @@ pub(crate) fn create_max_min_accs(
(max_values, min_values)
}

fn add_row_stats(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they don't share the same behavior, but it didn't show up due to lack of testing. The add() function operates in the safest way; if one of the operands is absent, the result will also be absent. On the other hand, the remove() function keeps the non-absent value by changing its exactness (absent + exact(value) => inexact(value)).


/// If the given value is numerically greater than the original maximum value,
/// return the new maximum value with appropriate exactness information.
fn set_max_if_greater(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a similar difference here as well. Max values are conserved by relaxing the exactness when an absent statistic is read from the file.

}
}

/// If the given value is numerically lesser than the original minimum value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@ozankabak
Copy link
Contributor

@berkaysynnada, good catch! Let's consider changing the add/max/min methods to create Inexact types when they encounter an Absent. @alamb, can you think of a case where we would need to propagate Absent if any operand is Absent?

@alamb
Copy link
Contributor Author

alamb commented Nov 15, 2023

What are your thoughts on modifying the functionalities of the precision methods instead? I mean these precision methods' logic could be replaced by the functions you removed.

This is an excellent catch @berkaysynnada -- thank you. My thoughts are that I would like to initially not change any behavior unless it is clearly a bug. So in this case, I will:

  1. add tests to the datasource statistics calculations so there is coverage for this behavior
  2. move this code into Precision (as a method like estimated_add or something)

The reason to move the code into Precision is to consolidate how this structure is being used (otherwise we can end up with surprises like this)

@berkaysynnada, good catch! Let's consider changing the add/max/min methods to create Inexact types when they encounter an Absent. @alamb, can you think of a case where we would need to propagate Absent if any operand is Absent?

I can not. One test would be to try to change Precision::add and see if any tests break, though given our lack of coverage maybe this doesn't tell us as much as we would like.

@alamb alamb marked this pull request as draft November 15, 2023 16:32
@@ -160,17 +159,6 @@ pub(crate) fn create_max_min_accs(
(max_values, min_values)
}

fn add_row_stats(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it is always adding exact values, so there should be no difference on the results whether it is used add() or estimated_add()

Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How embarassing -- I even noted the lack of unit tests in this file on https://github.com/apache/arrow-datafusion/pull/7793/files#r1358805175 🤦

@alamb alamb mentioned this pull request Nov 15, 2023
8 tasks
@berkaysynnada
Copy link
Contributor

How embarassing -- I even noted the lack of unit tests in this file on https://github.com/apache/arrow-datafusion/pull/7793/files#r1358805175 🤦

Sadly, while I was working on the changes, I noticed that in many parts where we use statistics, we don't test them enough. You have done really well starting the Epic. It definitely requires quite a bit of work.

@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2023

Improve statistics test coverage #8228

Yes, thank you -- I plan to improve the situation via #8228

@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2023

I'll keep working on statistics under the aegis of a different issue

@alamb alamb closed this Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants