-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ColumnStatistics::Sum
#14074
base: main
Are you sure you want to change the base?
Add ColumnStatistics::Sum
#14074
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @gatesn -- I think this is a nice addition.
It looks like the cargo fmt
test is failing
Ideally we would add unit test coverage for Precision::multiply
Precision::sub
and Precision::cast_to
before we merged.
Thanks again -- excited to see this working
FYI @suremarc @berkaysynnada / @ozankabak as this changes statistics and I think you are already working on things related to that:
@@ -436,6 +492,8 @@ pub struct ColumnStatistics { | |||
pub max_value: Precision<ScalarValue>, | |||
/// Minimum value of column | |||
pub min_value: Precision<ScalarValue>, | |||
/// Sum value of a column | |||
pub sum_value: Precision<ScalarValue>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I think we mentioned in #13736 my only real concern with this addition is that it will make ColumnStatistics
even bigger (each ScalarValue is quite large already and ColumnStatistics
are copied a bunch
However, I think the "right" fix for that is to move to using a different statistics representation (e.g. Arc::ColumnStatistics
) so I don't see this as a blocker
(_, _) => Precision::Absent, | ||
} | ||
} | ||
|
||
/// Casts the value to the given data type, propagating exactness information. | ||
pub fn cast_to(&self, data_type: &DataType) -> Result<Precision<ScalarValue>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb one question I have is whether this should return a Result, or we should assume that a failed cast implies overflow and therefore return Precision::Absent
?
The caller (currently in cross-join) unwraps to Absent
, I just didn't know whether to internalize that here.
Edit: I decided it was better to propagate the error and allow the caller to decide. It was more useful in a couple of places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some tests and hopefully appeases the linter!
We've started to refactor. The design is complete, and the implementation is in progress. I’ve taken a look at this and have some questions. For example, are we planning to add many types of functions to statistics, or is there a defined list of statistics that can be inferred from the sources or have meaningful applications in optimizer rules? If we agree that these kinds of extensions to column statistics are indeed useful and obtainable, then we can proceed with merging this. We would also ensure it is included in the new setup. FYI @ozankabak |
Which issue does this PR close?
This PR adds a sum statistic to DataFusion.
Future use will include optimizing aggregation functions (sum, avg, count), see https://github.com/apache/datafusion/pull/13736/files for examples.
Are there any user-facing changes?
The ColumnStatistics struct has an extra public sum_value field.