Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safeguard against potential inexact row count being smaller than exact null count #9007

Merged
merged 2 commits into from
Jan 27, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion datafusion/physical-plan/src/joins/utils.rs
Original file line number Diff line number Diff line change
Expand Up @@ -955,7 +955,12 @@ fn max_distinct_count(
let result = match num_rows {
Precision::Absent => Precision::Absent,
Precision::Inexact(count) => {
Precision::Inexact(count - stats.null_count.get_value().unwrap_or(&0))
// To safeguard against inexact number of rows (e.g. 0) being smaller than
// an exact null count we need to do a checked subtraction.
match count.checked_sub(*stats.null_count.get_value().unwrap_or(&0)) {
None => Precision::Inexact(0),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This made more sense than Precision::Absent.

Also, I'm not sure whether this can happen below as well, i.e. an inexact null count being larger than an exact row count.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- the use of Statistics::get_value() I think may also have other bugs as get_value() may be exact or inexact but there are some places in the code that treat it as though it were always exact (like here)

I have hopes to improve statistics in general (see #8227) but other higher priority things have kept me busy. I think @berkaysynnada was also working on this item for a while -- I am not sure if they have any short term plans

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue has been out of my focus for a while. I can help those who wish to take it on and make progress. Unfortunately, addressing this issue is not in my short-term plans.

Some(non_null_count) => Precision::Inexact(non_null_count),
}
}
Precision::Exact(count) => {
let count = count - stats.null_count.get_value().unwrap_or(&0);
Expand Down
Loading