Support deep schema pruning and projection #11745

adragomir · 2024-07-31T11:41:20Z

Is your feature request related to a problem or challenge?

At the moment, Datafusion supports top-level column pruning - we have a mechanism, projection: [usize] where we detect, and pass through all the layers a set of top-level columns to get from a schema. The columns are inferred from the input and passed through all the layers (logical -> optimize -> physical). Some implementation can also take advantage of these to minimize the data read from storage at the lowest level .

However, for deeply nested schemas (a small number of huge deeply nested top-level column, list of structs with maps etc), this optimization is not so useful, because the actual top level columns are very large.

Describe the solution you'd like

We should have a way to represent, and push through all the layers the "deep" projection of the actual leaves that we need in the query.
The schema and data returned after applying the deep schema pruning should reflect the changes (select 1 field from a struct in a list, we should get a list with a struct with a single field etc)
The feature needs to be applicable only to physical layouts that actually support it (for example Parquet and Arrow)

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

alamb · 2024-07-31T15:31:32Z

Possibly related to #2581

I think there may be some useful work that was added to arrow-rs apache/arrow-rs#5148

cc @goldmedal who has been working on something similar

alamb · 2024-07-31T15:31:45Z

Also, maybe @jayzhan211

jayzhan211 · 2024-08-03T02:07:32Z

I think it is worth to figure out the efficient way to deal with nested schema on arrow side. #2581. Then it will be clear how we could leverage on it. We might not need scan_deep hash map at the end 🤔

adragomir added the enhancement New feature or request label Jul 31, 2024

adragomir mentioned this issue Jul 31, 2024

[WIP] Support deep schema pruning and projection #11747

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support deep schema pruning and projection #11745

Support deep schema pruning and projection #11745

adragomir commented Jul 31, 2024

alamb commented Jul 31, 2024 •

edited

Loading

alamb commented Jul 31, 2024

jayzhan211 commented Aug 3, 2024

Support deep schema pruning and projection #11745

Support deep schema pruning and projection #11745

Comments

adragomir commented Jul 31, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Jul 31, 2024 • edited Loading

alamb commented Jul 31, 2024

jayzhan211 commented Aug 3, 2024

alamb commented Jul 31, 2024 •

edited

Loading