-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized version of SortPreservingMerge
that doesn't actually compare sort keys of the key ranges are ordered
#10316
Comments
Here is the entire implementation in case anyone wants this: https://github.com/influxdata/influxdb3_core/blob/b546e7f86ee9adbff0dd3c5e687140848397604a/iox_query/src/provider/progressive_eval.rs |
Oh yeah, we're going to be hitting this all the time. Definitely want to see this! 😁 |
Also to be clear to everyone else -- we have this code in InfluxDB already but it would be great to have other people be able to use it (and help maintain it) |
Thanks @alamb. I am happy to port |
I think we should both port ProgressiveEval as well as hook it up in the optimizer so it is is used (likely based on the analysis in #9593) |
If the hooking work is not that tricky (which I think the case), I am happy to do that, too |
@alamb Just to be absolutely clear, if the plan consists entirely of Parquet files from a single table, then the That said, it's worth pointing out that |
What comes to my mind is that if we can successfully bring the |
I think this approach would work as well -- or maybe just a flag on the Per partition statistics is an interesting idea 🤔 -- it certainly makes sense for things like data sources. I wonder how generally useful it could be |
This is the ticket to port |
Update here:
So in my mind, what is required to move on with this ticket is:
For the "do key ranges overlap" detection code I think we can use what @suremarc added in #9593 |
I think the status is still as in #10316 (comment) I believe @suremarc said he may have some ideas of how to add the "do key ranges overlap" code / tests. @wj-stack perhaps you would be willing to write some tests (that should be able to use SortPreservingMerge) to help the project along? You should be able to make a sqllogictest -- see docs here https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest#readme Here is an example of creating an existing table with order:
Here are examples of creating parquet files as part of tests:
|
Hey @alamb, I am interested in using SELECT * FROM t1
WHERE "timestamp" < cutoff
UNION ALL
SELECT * FROM t2
WHERE "timestamp" >= cutoff
ORDER BY "timestamp" In principle it should be possible to read results from I do think we can reuse the analysis in #9593, but ideally we would have statistics per partition as @ozankabak mentioned. This would allow us to implement the operator generically, without really having to inspect its children. |
I went ahead and made an attempt at implementing this over in this draft PR: #13296 I was able to reuse the I have no nontrivial way to test this code until we get statistics per partition. In the PR I proposed the following API: pub trait ExecutionPlan: [...] {
// [...]
fn statistics_by_partition(&self) -> Result<Vec<Statistics>> {
// Return global statistics by default
Ok(vec![
self.statistics()?;
self.properties().partitioning.partition_count()
])
}
} In order for the statistics to be useful we'll actually need non-default implementations of course. So I'm wondering if I should just implement this method just for As discussed in Epic: Statistics Improvements we will need #8078 in order for this code to actually work properly in all situations, but I believe @alamb is working on that. |
It might be a good idea to plumb it in to see how easy/hard it is I am trying to think of a way to avoid the per-partition statistics but I am currently drawing a blank |
Is your feature request related to a problem or challenge?
When merging a large number of pre-sorted streams (e.g. in our case, a large number of pre-sorted parquet files) the actual work in
SortPreservingMerge
to keep them sorted is often substantial (as the sort key of each row in each stream must be compared the other potential candidates)Here is the sort preserving merge
datafusion/datafusion/physical-plan/src/sorts/sort_preserving_merge.rs
Lines 39 to 67 in 4edbdd7
However, in some cases (such as @suremarc has identified in #6672) we can use information about how the values of the sort key columns are distributed to avoid needing a sort
For example, if we have three files that are each sorted by
time
and have the following rangesmin(time) = 2024-01-01
andmax(time) = 2024-01-31
min(time) = 2024-02-01
andmax(time) = 2024-02-28
min(time) = 2024-03-01
andmax(time) = 2024-03-31
We can produce the output sorted stream by first reading file1.parquet entirely then file2.parquet, then file3.parquet
Not only will this be faster than using
SortPreservingMerge
it will require less intermediate memory as we don't need to read a batch from each input stream to begin producing output. For cases where there may be 100s of files, this can minimize the amount of concurrently outstanding requests substantiallyAlso, for a query that will not read the entire dataset (e.g. only wants the most recent values) it can be especially beneficial:
In this case our example above would only ever read file1.parquet (wouldn't even open the others) if it had more than 10 rows
Describe the solution you'd like
I would like an operator that does not actually merge if not needed
Describe alternatives you've considered
@NGA-TRAN implemented the following operator in InfluxDB IOx
ProgressiveEval
which we have found works pretty well and has offered to contribute it back upstreamWe wrote about using this operator here: https://www.influxdata.com/blog/making-recent-value-queries-hundreds-times-faster/
Additional context
The original inspiration for this operator came from @pauldix (who I think mentioned it was inspired by ElasticSearch)
The text was updated successfully, but these errors were encountered: