Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add possibility to manually perform the column projection #565

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

pfackeldey
Copy link
Collaborator

@pfackeldey pfackeldey commented Jan 14, 2025

This PR adds a new function (dak.manual.optimize_columns) that allows to perform the column projection by hand given a set of columns (or keys) in the same form that dak.inspect.report_necessary_columns returns. (Closes #559)

Why is this useful?

  • This PR allows to run dak.inspect.report_necessary_columns once for one dataset and then reuse its output to manually project these columns for other datasets.
  • This PR allows the user to manipulate the set of columns, see the test tests/test_manual.py as an example.

Note

Currently, dak.manual.optimize_columns will disable the possibility to project theAwkwardInputLayers again. This is because otherwise there might be unexpected results when running the manual optimization and then the automatic one that's enabled by default.

Looking forward to your feedback!

PS: If this gets accepted, I'll open another PR in uproot that allows to perform the manual column optimization aswell.

@codecov-commenter
Copy link

codecov-commenter commented Jan 14, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 94.28571% with 2 lines in your changes missing coverage. Please review.

Project coverage is 92.51%. Comparing base (8cb8994) to head (d04dc80).
Report is 206 commits behind head on main.

Files with missing lines Patch % Lines
src/dask_awkward/manual/column_optimization.py 90.00% 2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #565      +/-   ##
==========================================
- Coverage   93.06%   92.51%   -0.56%     
==========================================
  Files          23       24       +1     
  Lines        3290     3458     +168     
==========================================
+ Hits         3062     3199     +137     
- Misses        228      259      +31     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pfackeldey
Copy link
Collaborator Author

The Conda tests seems to fail unrelated to this PR right now.

@martindurant
Copy link
Collaborator

The Conda tests seems to fail unrelated to this PR right now.

mambaforge is deprecated, it should be reverted back to miniconda in the GHA yaml definition

Copy link
Collaborator

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a first quick scan, it is indeed going to be sueful and not too complex to extract the column selection/optimisation as you have envisaged.

I only have a couple of comments so far, but one bigger question: if you are familiar with the one-pass PR, is it obvious how to port what is here to that branch? It should not be hard, since necessary_columns still exists, and you're not actually calling any of the opt machinery here, but I think it's worth checking - in the long run, I still think that one-pass is what we should aim for.

io_func_implements_projection(self.io_func)
and not self.has_been_unpickled
)
assert isinstance(self._is_projectable, bool)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These ind of type-checks should really not be necessary. In fact, assertions should not normally appear in non-test code at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes 👍 I'll remove them

def project_manually(self, columns: frozenset[str]) -> AwkwardInputLayer:
"""Project the necessary _columns_ to the AwkwardInputLayer."""
assert self.is_projectable
io_func = cast(ImplementsProjection, self.io_func).project_manually(columns)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I am not a fan of all the cruft is takes to make mypy happy - I'd rather do with simpler or missing types.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the same implementations as we have here for project and necessary_columns.

Copy link
Collaborator Author

@pfackeldey pfackeldey Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to remove it, but unfortunately it is not that easy to appease mypy otherwise (except for putting # type: ignore at multiple places - but what's the point of type checking then?).

The reason for the current implementation is that we support a protocol that all projectable IO layers need to adhere to (i.e. ImplementsProjection). By making sure that self.is_projectable is true, we know that our IO layer has the methods of the ImplementsProjection protocol available. Unfortunately, mypy seems to not recognize this correctly, because self.io_func may be a more general implementation that only needs to implement the ImplementsIOFunction protocol. ImplementsProjection is a specification of that, and we can safely "cast" it here after making sure that self.is_projectable is True.

It's not something that I invented/added new here. I'm following the existing protocols for the IO functions.

(The same argument goes for the comment below)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not something that I invented/added new here. I'm following the existing protocols for the IO functions.

Agreed. It is mostly removed in the one-pass PR, and I would say nothing useful got lost :)

@@ -44,6 +44,8 @@ def attrs(self) -> dict | None: ...

def project_columns(self: T, columns: frozenset[str]) -> T: ...

def project_manually(self: T, columns: frozenset[str]) -> ImplementsIOFunction: ...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fully specified in the implementation of ColumnProjectionMixin, right? So does the protocol definition here serve any purpose?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be able to remove it without mypy complaining, I'll check. Apart from that, I was adding it here as other ones were defined for this protocol aswell, e.g. project_columns.

from dask_awkward.lib.core import Array


def optimize_columns(array: Array, columns: dict[str, frozenset[str]]) -> Array:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to think of way to ensure that the resulting output is not optimised again at compute time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I set .is_projectable on the layers that are optimized "by hand" to False. However, I'm not sure if that's the most elegant solution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, OK, I can see that works. We should see what it does with multiple input layers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 It should work as follows when calling .compute(optimize_graph=True): any layer that has been manually projected won't be optimized again (they are basically treated as a any other non-column-optimizable layer), whereas any other AwkwardInputLayer goes through the usual automatic optimization process.


assert u4_and_u8_array.fields == ["u4", "u8"]

materialized_u4_and_u8_array = u4_and_u8_array.compute()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With or without optimise?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would result in the same output, because currently dak.manual.optimize_columns sets the is_projectable property of AwkwardInputLayers to False, which disables any other column optimization afterwards.

@pfackeldey
Copy link
Collaborator Author

On a first quick scan, it is indeed going to be sueful and not too complex to extract the column selection/optimisation as you have envisaged.

I only have a couple of comments so far, but one bigger question: if you are familiar with the one-pass PR, is it obvious how to port what is here to that branch? It should not be hard, since necessary_columns still exists, and you're not actually calling any of the opt machinery here, but I think it's worth checking - in the long run, I still think that one-pass is what we should aim for.

I think so aswell: It should be straight forward to port this PR to the one-pass PR because it just uses the underlying optimization implementation and doesn't care how it's implemented. I haven't tried it yet though. Given that it's conceptually independent, I went forward to open the PR already now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose column optimization more explicitly?
3 participants