Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply projection to Statistics in FilterExec #13187

Merged
merged 3 commits into from
Nov 3, 2024
Merged

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Oct 30, 2024

Which issue does this PR close?

Closes #13186

Rationale for this change

Fix regression introduced in #12281

What changes are included in this PR?

  1. Fix regression
  2. Add tests

Are these changes tested?

Yes, new tests are added

Are there any user-facing changes?

@github-actions github-actions bot added physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Oct 30, 2024
self.predicate(),
self.default_selectivity,
)?;
Ok(stats.project(self.projection.as_ref()))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bug fix

where cpu = 3
) where rn > 0;
----
1970-01-01T00:00:00 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this query errors without the fix

@alamb alamb marked this pull request as ready for review October 30, 2024 17:47
@alamb
Copy link
Contributor Author

alamb commented Oct 30, 2024

FYI @eejbyfeldt

Copy link
Contributor

@eejbyfeldt eejbyfeldt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. LGTM!

I wonder we could and some debug_assert or something to catch bugs of this sort in some general way.

/// For example, if we had statistics for columns `{"a", "b", "c"}`,
/// projecting to `vec![2, 1]` would return statistics for columns `{"c",
/// "b"}`.
pub fn project(mut self, projection: Option<&Vec<usize>>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@Dandandan Dandandan Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, also this implementation seems a bit more efficient than the one in hash join.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 90aa0bd

@@ -371,7 +371,12 @@ impl ExecutionPlan for FilterExec {
/// The output statistics of a filtering operation can be estimated if the
/// predicate's selectivity value can be determined for the incoming data.
fn statistics(&self) -> Result<Statistics> {
Self::statistics_helper(&self.input, self.predicate(), self.default_selectivity)
let stats = Self::statistics_helper(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the global stats (total_byte_size) are not correct either, doesn't take into account the reduced number of columns. It should do something similar as stats_projection for ProjectionExec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the statistics calculation should be more sophisticated and I filed #13224 to track the idea

However, I am worried about trying to change how the statistics calculations work in this PR (I outlined some challenges I see in #13224)

Thus, I would like avoid doing a more substantial change in this PR (which fixes a functional bug) and we can sort out how to improve the statistics calculations as a subsequent PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

Arc::clone(&self.left),
Arc::clone(&self.right),
self.on.clone(),
&self.join_type,
&self.join_schema,
)?;
// Project statistics if there is a projection
if let Some(projection) = &self.projection {
stats.column_statistics = stats
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code also appears to assume projections never contain repeated values 🤔 I can try and improve the statistics calculation in a future PR to avoid cloning

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm yeah, although that assumption probably holds at least in DF codebase (still not good to assume).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a PR that projects the column statistics without copying as well as handling repetitions: #13225

return self;
};

// todo: it would be nice to avoid cloning column statistics if
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow on PR to improve the performance: #13225

@alamb
Copy link
Contributor Author

alamb commented Nov 3, 2024

Thanks for the fix. LGTM!

I wonder we could and some debug_assert or something to catch bugs of this sort in some general way.

This is an interesting idea @eejbyfeldt . Would the debug assert be implemented on statistics? I suppose we could potentially implement a function that verifies that the output of ExecutionPlan::statistics has the same number of columns as ExecutionPlan::schema() and that the types matched 🤔

@alamb alamb merged commit 85f92ef into apache:main Nov 3, 2024
24 checks passed
@alamb alamb deleted the alamb/fix_stats branch November 3, 2024 12:07
@alamb
Copy link
Contributor Author

alamb commented Nov 3, 2024

Thank you for the reviews @Dandandan and @eejbyfeldt

alamb added a commit to influxdata/arrow-datafusion that referenced this pull request Nov 21, 2024
* Apply projection to `Statistics` in `FilterExec`

* Use Statistics::project in HashJoin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: "Internal error: Only intervals with the same data type are comparable, lhs:Float32, rhs:UInt64."
4 participants