-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Add Precision:AtLeast
and Precision::AtMost
for more Statistics
… precision
#13293
Conversation
/// The value is know to be at most (inclusive) of this value. | ||
/// | ||
/// The actual value may be smaller, but it is never greater. | ||
AtMost(T), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main change -- adding these variants
Without knowing too much about the use case for inexact statistics, is it possible we may need both inexact and "precise" upper/lower bounds for column statistics? I.e. a tight, inexact lower/upper bound, and then a looser "real" upper & lower bound . I can see this causing tension between parts of the codebase that benefit from tighter but inexact bounds and parts that benefit from having correct bounds. |
I am also not super sure about the usecase for inexact statistics. I think there was some idea that knowing a value was likely close to 1M would be more helpful than simply discarding the values. However, almost all the operations I can think of (filtering, limit, aggregation) don't make the output range larger than the input. Maybe could consider simply removing Precision {
Exact,
AtMost,
AtLeast,
Unknown
} I still do feel like having I wonder if @berkaysynnada has any thoughts or insights? |
|
I wonder if a range semantics would be nicer, i.e.: struct Precision {
lower: Option<T>,
upper: Option<T>,
} And then you have:
If you really want to support inexact estimations, you could extend it like this: struct Precision {
lower: Bound<T>,
upper: Bound<T>,
}
enum Bound {
Absent,
Inexact(T),
Exact(T),
} |
That is a good idea. One think is that we already have a version of that type of Interval logic (that handles open/closed upper/lower bounds, etc_ here: https://docs.rs/datafusion-expr-common/42.2.0/src/datafusion_expr_common/interval_arithmetic.rs.html#160-163 https://docs.rs/datafusion/latest/datafusion/logical_expr/interval_arithmetic/struct.Interval.html However, that is hard coded to use ScalarValue where But maybe we can just provide a conversion to/from |
With single estimated value the conceptual idea is that the random variable we're estimating has distribution "condensed" around that value. One can imagine this being normal distribution with the value being the mean. Obviously this is over-simplification. Not everything has normal distribution. For example non-negative numbers. But this mental model is easy to work with. While exact ranges are super useful (for example for predicate derivation and pruning), inexact ranges as statistics model pose a problem how to interpret the value, when e.g. judging which side of the join is smaller, or when computing filter on top of some other computation. It's tempting to capture uncertainty as ever-widening ranges and to finally interpret the range as its middle value. This is definitely more complex mental model and it will suerly result in the code being more complex. Will it also result in better estimates? Maybe. There are also other alternatives so consider
|
Perhaps it makes sense to somehow separate estimated statistics vs bounds (e.g. compute / return both). In some situations you would like to know a bound (avoiding merge), while in other situations (e.g. join order) you would like to have a point estimate. |
Absolutely, the exact bounds must not be mistaken with estimates/statistics and need to be treated as two different things, even though they look similar. |
Based on the discussion, another idea: /// ...
///
/// # Note
/// Unknown ranges are modeled as `Precision::Range{lower: None, upper: None}`,
enum Precision {
/// Estimated, but very close to the given point.
///
/// This can be treated as an over-simplified normal distribution.
PointEstimation(T),
/// Value is for sure in the given open/half-open/closed range.
///
/// If `lower`/`upper` are given, they are INCLUSIVE ends.
Range {
lower: Option<T>,
upper: Option<T>,
},
} (this might need some more docs and helper methods, but I think you get the idea) |
I prefer going with
as mentioned there. We will be both making us of interval arithmetics, and eliminating the need for separate bound and estimation statistics. It does not necessitate to select which kind of stats (range or estimate) you keep. The only challenge is we need to guard the internal values, and provide the all API's someone can require. It should be never in an inconsistent state (like estimate value is out of the bounds). |
Why does |
If only bounds are given, the value could be its mean value maybe? Assuming a uniform dist should not harm |
I don't think we get a mean value from parquet for example. So that would be a rather opinionated assumption. Also note that this is somewhat hard or even impossible to calculate for some types (e.g. strings) |
Then do we need 4 states? -- Both bounds and estimation, only bounds, only estimation, and neither one |
yeah, if you want to have bounds AND a point estimator, then you need a larger state space, something like: struct Precision {
/// Actual values are very close to the given point.
///
/// This can be treated as an over-simplified normal distribution.
point_estimation: Option<T>,
/// Lower bound for a open/half-open/closed range.
///
/// If given, the bound is INCLUSIVE. The bounds may be
/// overestimated (i.e. the actual lower value may be larger)
/// but if provided, all values are included in this range.
lower: Option<T>,
/// Upper bound for a open/half-open/closed range.
///
/// If given, the bound is INCLUSIVE. The bounds may be
/// overestimated (i.e. the actual upper value may be smaller)
/// but if provided, all values are included in this range.
upper: Option<T>,
} |
It's been a month and I haven't seen any new proposals. IIUC the main use case for inexact statistics is to estimate Unless I'm misunderstanding, it seems like @crepererum's proposed API accommodates both of these use cases. Another open question is if we should try to unify I am interested in getting this change in so that I can resume work on #13296, so I am going to start pre-emptively working on a PR with the new |
@suremarc if you are going to work on Statistics, here are some properties I think would be most useful:
It would be really great to consolidate the statistics aggregation code (e.g. that combines statistics across files) into a single struct / location (but that is a good follow on perhaps) |
I wrote up a unified proposal here:
Feedback very welcome |
This is superceded by the current work from @ozankabak and Synnada:
So closing it for now |
Which issue does this PR close?
SortPreservingMerge
that doesn't actually compare sort keys of the key ranges are ordered #10316Discussion:
This is a Request for Comment (and maybe also a POC)
I hacked very briefly on a different approach (
Precision::Interval
) but found:ColumnStatistics
already has amin
andmax
(basically a bound) so introducing an interval intoPrecision
would likely mean we would need to changeColumnStatistics
to have anvalue: Precision<..>
as well -- which might be a better choice, but it would be a bigger changeInterval
is defined in a different crate so we can't easily use it inStatistics
s without a bunch of codeRationale for this change
For the analysis described on #10316, we need to know if a value is constrained to a range to avoid merging. However the current
Statistics
are eitherExact
orInexact
, so once the precision becomes Inexact we lose the information that the possible minimum / maximum values do not change.This came up twice for me recently:
SortPreservingMerge
that doesn't actually compare sort keys of the key ranges are ordered #10316)I hacked around it downstream, but I think this is all sounding like it is time to add a new
Precision
enum that allows for this usecaseWhat changes are included in this PR?
Precision::AtMost
,Precision::AtLeast
ColumnStatistics::min
andColumnStatistics::max
that return the correct valueAre these changes tested?
Are there any user-facing changes?