-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable split_file_groups_by_statistics
by default
#10336
Comments
Example test coverage we should add I think: #9593 (comment) |
I'd like to help it. 🙌 |
THank you @yyy1000 🙏 I think a good place to start would be to write some sqllogic level tests to cover the important cases Perhaos for the first test:
I think we could extend https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/parquet_sorted_statistics.slt cc @suremarc |
One thing I've noticed is that after DataFusion 40 this actually works in my use case, likely thanks to the statistics code getting fixed, so good news there! It does require additionally setting However for my entirely sorted and non-overlapping dataset it did make Parquet scanning single-threaded ( The consequence to this issue being that turning this on by default would regress performance for users that have |
Is your feature request related to a problem or challenge?
Part of #10313
In #9593, @suremarc added a way to reorganize input files in a ListingTable to avoid a merge, if the sort key ranges do not overlap
This feature is behind a feature flag,
split_file_groups_by_statistics
which defaults tofalse
as I think there needs to be some more tests in place before we turn it onDescribe the solution you'd like
Add additional tests and then enable
split_file_groups_by_statistics
by defaultDescribe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: