Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable split_file_groups_by_statistics by default #10336

Open
Tracked by #10313
alamb opened this issue May 1, 2024 · 5 comments
Open
Tracked by #10313

Enable split_file_groups_by_statistics by default #10336

alamb opened this issue May 1, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented May 1, 2024

Is your feature request related to a problem or challenge?

Part of #10313

In #9593, @suremarc added a way to reorganize input files in a ListingTable to avoid a merge, if the sort key ranges do not overlap

This feature is behind a feature flag, split_file_groups_by_statistics which defaults to false as I think there needs to be some more tests in place before we turn it on

Describe the solution you'd like

Add additional tests and then enable split_file_groups_by_statistics by default

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label May 1, 2024
@alamb
Copy link
Contributor Author

alamb commented May 1, 2024

Example test coverage we should add I think: #9593 (comment)

@yyy1000
Copy link
Contributor

yyy1000 commented May 4, 2024

I'd like to help it. 🙌

@alamb
Copy link
Contributor Author

alamb commented May 4, 2024

THank you @yyy1000 🙏

I think a good place to start would be to write some sqllogic level tests to cover the important cases

Perhaos for the first test:

  1. Create files: file1.parquet, file2.parquet both sorted on a but file 1 has the columns in the order a, b, c and file has the columns in the order c, b, a. The keyranges of values of a should be non overlapping
  2. Create an external table a, b, c with explicit order by a, and then query SELECT ... ORDER BY a and make sure the output plan doesn't use sort preserving merge

I think we could extend https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/parquet_sorted_statistics.slt

cc @suremarc

@leoyvens
Copy link
Contributor

One thing I've noticed is that after DataFusion 40 this actually works in my use case, likely thanks to the statistics code getting fixed, so good news there! It does require additionally setting execution.collect_statistics = true, which makes sense.

However for my entirely sorted and non-overlapping dataset it did make Parquet scanning single-threaded (ParquetScan with a single file group), which is a big performance regression. So it didn't really help me, maybe I actually want #10316.

The consequence to this issue being that turning this on by default would regress performance for users that have execution.collect_statistics = true. Maybe the flag should be merged with prefer_existing_sort, which has the semantics of avoiding sorts at the cost of limiting parallelism. Or maybe just wait for #10316, so we can both avoid the sort and still have a parallel ParquetExec.

@alamb
Copy link
Contributor Author

alamb commented Jul 29, 2024

Sorry for the delay @leoyvens and thank you for this analysis

#11170

I would personally love to take this approach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants