-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug-fix in Filter and Limit statistics #8094
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the subtlety that may be involved here I think we should have test for these changes. Given that no existing test breaks, that suggests to me that it isn't sufficiently covered 🤔
I have added tests covering the 1st and 3rd cases, but can't find a test suite for the 2nd case that I can demonstrate the need for the fix. Do you have any advice for me to do that easily? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @berkaysynnada -- I think this is better for sure. I hope to keep improving test coverage as part of #8078
@@ -70,7 +70,11 @@ pub async fn get_statistics_with_limit( | |||
// files. This only applies when we know the number of rows. It also | |||
// currently ignores tables that have no statistics regarding the | |||
// number of rows. | |||
if num_rows.get_value().unwrap_or(&usize::MIN) <= &limit.unwrap_or(usize::MAX) { | |||
let conservative_num_rows = match num_rows { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Which issue does this PR close?
Related to #8078.
Rationale for this change
After
enum Precision
is introduced in DF, some bugs are discovered. This PR resolves 3 bugs:ColumnStatistics
,What changes are included in this PR?
1st fix: A column is labeled as singleton only if its min and max values are exact and equal.
2nd fix: To stop processing, only exact count of rows is regarded. Otherwise, we should continue to process until range estimation of precision implemented.
3rd fix: During the analysis in filter statistics, if a column is filtered with a constant value (e.g. c=1), we set its min and max values as exact.
Are these changes tested?
Are there any user-facing changes?