feat: add possibility to manually perform the column projection #565

pfackeldey · 2025-01-14T21:05:04Z

This PR adds a new function (dak.manual.optimize_columns) that allows to perform the column projection by hand given a set of columns (or keys) in the same form that dak.inspect.report_necessary_columns returns. (Closes #559)

Why is this useful?

This PR allows to run dak.inspect.report_necessary_columns once for one dataset and then reuse its output to manually project these columns for other datasets.
This PR allows the user to manipulate the set of columns, see the test tests/test_manual.py as an example.

Note

Currently, dak.manual.optimize_columns will disable the possibility to project theAwkwardInputLayers again. This is because otherwise there might be unexpected results when running the manual optimization and then the automatic one that's enabled by default.

Looking forward to your feedback!

PS: If this gets accepted, I'll open another PR in uproot that allows to perform the manual column optimization aswell.

codecov-commenter · 2025-01-14T21:08:16Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 94.28571% with 2 lines in your changes missing coverage. Please review.

Project coverage is 92.51%. Comparing base (8cb8994) to head (d04dc80).
Report is 206 commits behind head on main.

Files with missing lines	Patch %	Lines
src/dask_awkward/manual/column_optimization.py	90.00%	2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #565      +/-   ##
==========================================
- Coverage   93.06%   92.51%   -0.56%     
==========================================
  Files          23       24       +1     
  Lines        3290     3458     +168     
==========================================
+ Hits         3062     3199     +137     
- Misses        228      259      +31

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pfackeldey · 2025-01-14T21:16:49Z

The Conda tests seems to fail unrelated to this PR right now.

martindurant · 2025-01-14T21:52:48Z

The Conda tests seems to fail unrelated to this PR right now.

mambaforge is deprecated, it should be reverted back to miniconda in the GHA yaml definition

martindurant

On a first quick scan, it is indeed going to be sueful and not too complex to extract the column selection/optimisation as you have envisaged.

I only have a couple of comments so far, but one bigger question: if you are familiar with the one-pass PR, is it obvious how to port what is here to that branch? It should not be hard, since necessary_columns still exists, and you're not actually calling any of the opt machinery here, but I think it's worth checking - in the long run, I still think that one-pass is what we should aim for.

martindurant · 2025-01-14T22:00:03Z

src/dask_awkward/layers/layers.py

+                io_func_implements_projection(self.io_func)
+                and not self.has_been_unpickled
+            )
+        assert isinstance(self._is_projectable, bool)


These ind of type-checks should really not be necessary. In fact, assertions should not normally appear in non-test code at all.

yes 👍 I'll remove them

martindurant · 2025-01-14T22:00:44Z

src/dask_awkward/layers/layers.py

+    def project_manually(self, columns: frozenset[str]) -> AwkwardInputLayer:
+        """Project the necessary _columns_ to the AwkwardInputLayer."""
+        assert self.is_projectable
+        io_func = cast(ImplementsProjection, self.io_func).project_manually(columns)


Personally, I am not a fan of all the cruft is takes to make mypy happy - I'd rather do with simpler or missing types.

I was following the same implementations as we have here for project and necessary_columns.

I tried to remove it, but unfortunately it is not that easy to appease mypy otherwise (except for putting # type: ignore at multiple places - but what's the point of type checking then?).

The reason for the current implementation is that we support a protocol that all projectable IO layers need to adhere to (i.e. ImplementsProjection). By making sure that self.is_projectable is true, we know that our IO layer has the methods of the ImplementsProjection protocol available. Unfortunately, mypy seems to not recognize this correctly, because self.io_func may be a more general implementation that only needs to implement the ImplementsIOFunction protocol. ImplementsProjection is a specification of that, and we can safely "cast" it here after making sure that self.is_projectable is True.

It's not something that I invented/added new here. I'm following the existing protocols for the IO functions.

(The same argument goes for the comment below)

It's not something that I invented/added new here. I'm following the existing protocols for the IO functions.

Agreed. It is mostly removed in the one-pass PR, and I would say nothing useful got lost :)

martindurant · 2025-01-14T22:02:00Z

src/dask_awkward/lib/io/columnar.py

@@ -44,6 +44,8 @@ def attrs(self) -> dict | None: ...

    def project_columns(self: T, columns: frozenset[str]) -> T: ...

+    def project_manually(self: T, columns: frozenset[str]) -> ImplementsIOFunction: ...


This is fully specified in the implementation of ColumnProjectionMixin, right? So does the protocol definition here serve any purpose?

I may be able to remove it without mypy complaining, I'll check. Apart from that, I was adding it here as other ones were defined for this protocol aswell, e.g. project_columns.

martindurant · 2025-01-14T22:02:53Z

src/dask_awkward/manual/column_optimization.py

+from dask_awkward.lib.core import Array
+
+
+def optimize_columns(array: Array, columns: dict[str, frozenset[str]]) -> Array:


We need to think of way to ensure that the resulting output is not optimised again at compute time.

That's why I set .is_projectable on the layers that are optimized "by hand" to False. However, I'm not sure if that's the most elegant solution.

Hah, OK, I can see that works. We should see what it does with multiple input layers.

👍 It should work as follows when calling .compute(optimize_graph=True): any layer that has been manually projected won't be optimized again (they are basically treated as a any other non-column-optimizable layer), whereas any other AwkwardInputLayer goes through the usual automatic optimization process.

martindurant · 2025-01-14T22:03:27Z

tests/test_manual.py

+
+    assert u4_and_u8_array.fields == ["u4", "u8"]
+
+    materialized_u4_and_u8_array = u4_and_u8_array.compute()


With or without optimise?

This would result in the same output, because currently dak.manual.optimize_columns sets the is_projectable property of AwkwardInputLayers to False, which disables any other column optimization afterwards.

pfackeldey · 2025-01-14T22:51:54Z

On a first quick scan, it is indeed going to be sueful and not too complex to extract the column selection/optimisation as you have envisaged.

I only have a couple of comments so far, but one bigger question: if you are familiar with the one-pass PR, is it obvious how to port what is here to that branch? It should not be hard, since necessary_columns still exists, and you're not actually calling any of the opt machinery here, but I think it's worth checking - in the long run, I still think that one-pass is what we should aim for.

I think so aswell: It should be straight forward to port this PR to the one-pass PR because it just uses the underlying optimization implementation and doesn't care how it's implemented. I haven't tried it yet though. Given that it's conceptually independent, I went forward to open the PR already now.

feat: add possibility to manually perform the column projection

bf51d10

skip test if requests or aiohttp not available

f3e3a93

pfackeldey requested review from martindurant and agoose77 January 14, 2025 21:16

prettify doc string and type annotation

f99ce65

martindurant reviewed Jan 14, 2025

View reviewed changes

pfackeldey mentioned this pull request Jan 15, 2025

fix: _getitem_at_placeholder checks scikit-hep/awkward#3368

Open

remove assertions

d04dc80

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add possibility to manually perform the column projection #565

feat: add possibility to manually perform the column projection #565

pfackeldey commented Jan 14, 2025 •

edited

Loading

codecov-commenter commented Jan 14, 2025 •

edited

Loading

pfackeldey commented Jan 14, 2025

martindurant commented Jan 14, 2025

martindurant left a comment

martindurant Jan 14, 2025

pfackeldey Jan 14, 2025

martindurant Jan 14, 2025

pfackeldey Jan 14, 2025

pfackeldey Jan 15, 2025 •

edited

Loading

martindurant Jan 15, 2025

martindurant Jan 14, 2025

pfackeldey Jan 14, 2025

martindurant Jan 14, 2025

pfackeldey Jan 14, 2025

martindurant Jan 14, 2025

pfackeldey Jan 14, 2025

martindurant Jan 14, 2025

pfackeldey Jan 14, 2025

pfackeldey commented Jan 14, 2025

		@@ -44,6 +44,8 @@ def attrs(self) -> dict \| None: ...

		def project_columns(self: T, columns: frozenset[str]) -> T: ...

		def project_manually(self: T, columns: frozenset[str]) -> ImplementsIOFunction: ...

		from dask_awkward.lib.core import Array


		def optimize_columns(array: Array, columns: dict[str, frozenset[str]]) -> Array:


		assert u4_and_u8_array.fields == ["u4", "u8"]

		materialized_u4_and_u8_array = u4_and_u8_array.compute()

feat: add possibility to manually perform the column projection #565

Are you sure you want to change the base?

feat: add possibility to manually perform the column projection #565

Conversation

pfackeldey commented Jan 14, 2025 • edited Loading

Why is this useful?

Note

codecov-commenter commented Jan 14, 2025 • edited Loading

Codecov Report

pfackeldey commented Jan 14, 2025

martindurant commented Jan 14, 2025

martindurant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pfackeldey Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pfackeldey commented Jan 14, 2025

pfackeldey commented Jan 14, 2025 •

edited

Loading

codecov-commenter commented Jan 14, 2025 •

edited

Loading

pfackeldey Jan 15, 2025 •

edited

Loading