-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply projection to Statistics
in FilterExec
#13187
Conversation
self.predicate(), | ||
self.default_selectivity, | ||
)?; | ||
Ok(stats.project(self.projection.as_ref())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the bug fix
where cpu = 3 | ||
) where rn > 0; | ||
---- | ||
1970-01-01T00:00:00 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this query errors without the fix
16c9bcd
to
86690fd
Compare
FYI @eejbyfeldt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix. LGTM!
I wonder we could and some debug_assert or something to catch bugs of this sort in some general way.
/// For example, if we had statistics for columns `{"a", "b", "c"}`, | ||
/// projecting to `vec![2, 1]` would return statistics for columns `{"c", | ||
/// "b"}`. | ||
pub fn project(mut self, projection: Option<&Vec<usize>>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this method also be used when we project the statistics in hash_join?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, also this implementation seems a bit more efficient than the one in hash join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 90aa0bd
@@ -371,7 +371,12 @@ impl ExecutionPlan for FilterExec { | |||
/// The output statistics of a filtering operation can be estimated if the | |||
/// predicate's selectivity value can be determined for the incoming data. | |||
fn statistics(&self) -> Result<Statistics> { | |||
Self::statistics_helper(&self.input, self.predicate(), self.default_selectivity) | |||
let stats = Self::statistics_helper( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the global stats (total_byte_size
) are not correct either, doesn't take into account the reduced number of columns. It should do something similar as stats_projection
for ProjectionExec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the statistics calculation should be more sophisticated and I filed #13224 to track the idea
However, I am worried about trying to change how the statistics calculations work in this PR (I outlined some challenges I see in #13224)
Thus, I would like avoid doing a more substantial change in this PR (which fixes a functional bug) and we can sort out how to improve the statistics calculations as a subsequent PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
Arc::clone(&self.left), | ||
Arc::clone(&self.right), | ||
self.on.clone(), | ||
&self.join_type, | ||
&self.join_schema, | ||
)?; | ||
// Project statistics if there is a projection | ||
if let Some(projection) = &self.projection { | ||
stats.column_statistics = stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code also appears to assume projections never contain repeated values 🤔 I can try and improve the statistics calculation in a future PR to avoid cloning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm yeah, although that assumption probably holds at least in DF codebase (still not good to assume).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a PR that projects the column statistics without copying as well as handling repetitions: #13225
return self; | ||
}; | ||
|
||
// todo: it would be nice to avoid cloning column statistics if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow on PR to improve the performance: #13225
This is an interesting idea @eejbyfeldt . Would the debug assert be implemented on statistics? I suppose we could potentially implement a function that verifies that the output of |
Thank you for the reviews @Dandandan and @eejbyfeldt |
* Apply projection to `Statistics` in `FilterExec` * Use Statistics::project in HashJoin
Which issue does this PR close?
Closes #13186
Rationale for this change
Fix regression introduced in #12281
What changes are included in this PR?
Are these changes tested?
Yes, new tests are added
Are there any user-facing changes?