-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add possibility to manually perform the column projection #565
base: main
Are you sure you want to change the base?
feat: add possibility to manually perform the column projection #565
Conversation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #565 +/- ##
==========================================
- Coverage 93.06% 92.51% -0.56%
==========================================
Files 23 24 +1
Lines 3290 3458 +168
==========================================
+ Hits 3062 3199 +137
- Misses 228 259 +31 ☔ View full report in Codecov by Sentry. |
The Conda tests seems to fail unrelated to this PR right now. |
mambaforge is deprecated, it should be reverted back to miniconda in the GHA yaml definition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a first quick scan, it is indeed going to be sueful and not too complex to extract the column selection/optimisation as you have envisaged.
I only have a couple of comments so far, but one bigger question: if you are familiar with the one-pass PR, is it obvious how to port what is here to that branch? It should not be hard, since necessary_columns still exists, and you're not actually calling any of the opt machinery here, but I think it's worth checking - in the long run, I still think that one-pass is what we should aim for.
src/dask_awkward/layers/layers.py
Outdated
io_func_implements_projection(self.io_func) | ||
and not self.has_been_unpickled | ||
) | ||
assert isinstance(self._is_projectable, bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These ind of type-checks should really not be necessary. In fact, assertions should not normally appear in non-test code at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes 👍 I'll remove them
def project_manually(self, columns: frozenset[str]) -> AwkwardInputLayer: | ||
"""Project the necessary _columns_ to the AwkwardInputLayer.""" | ||
assert self.is_projectable | ||
io_func = cast(ImplementsProjection, self.io_func).project_manually(columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I am not a fan of all the cruft is takes to make mypy happy - I'd rather do with simpler or missing types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following the same implementations as we have here for project
and necessary_columns
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to remove it, but unfortunately it is not that easy to appease mypy otherwise (except for putting # type: ignore
at multiple places - but what's the point of type checking then?).
The reason for the current implementation is that we support a protocol that all projectable IO layers need to adhere to (i.e. ImplementsProjection
). By making sure that self.is_projectable
is true, we know that our IO layer has the methods of the ImplementsProjection
protocol available. Unfortunately, mypy seems to not recognize this correctly, because self.io_func
may be a more general implementation that only needs to implement the ImplementsIOFunction
protocol. ImplementsProjection
is a specification of that, and we can safely "cast" it here after making sure that self.is_projectable
is True.
It's not something that I invented/added new here. I'm following the existing protocols for the IO functions.
(The same argument goes for the comment below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not something that I invented/added new here. I'm following the existing protocols for the IO functions.
Agreed. It is mostly removed in the one-pass PR, and I would say nothing useful got lost :)
@@ -44,6 +44,8 @@ def attrs(self) -> dict | None: ... | |||
|
|||
def project_columns(self: T, columns: frozenset[str]) -> T: ... | |||
|
|||
def project_manually(self: T, columns: frozenset[str]) -> ImplementsIOFunction: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fully specified in the implementation of ColumnProjectionMixin, right? So does the protocol definition here serve any purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be able to remove it without mypy complaining, I'll check. Apart from that, I was adding it here as other ones were defined for this protocol aswell, e.g. project_columns
.
from dask_awkward.lib.core import Array | ||
|
||
|
||
def optimize_columns(array: Array, columns: dict[str, frozenset[str]]) -> Array: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to think of way to ensure that the resulting output is not optimised again at compute time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I set .is_projectable
on the layers that are optimized "by hand" to False
. However, I'm not sure if that's the most elegant solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah, OK, I can see that works. We should see what it does with multiple input layers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 It should work as follows when calling .compute(optimize_graph=True)
: any layer that has been manually projected won't be optimized again (they are basically treated as a any other non-column-optimizable layer), whereas any other AwkwardInputLayer goes through the usual automatic optimization process.
|
||
assert u4_and_u8_array.fields == ["u4", "u8"] | ||
|
||
materialized_u4_and_u8_array = u4_and_u8_array.compute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With or without optimise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would result in the same output, because currently dak.manual.optimize_columns
sets the is_projectable
property of AwkwardInputLayers to False
, which disables any other column optimization afterwards.
I think so aswell: It should be straight forward to port this PR to the one-pass PR because it just uses the underlying optimization implementation and doesn't care how it's implemented. I haven't tried it yet though. Given that it's conceptually independent, I went forward to open the PR already now. |
This PR adds a new function (
dak.manual.optimize_columns
) that allows to perform the column projection by hand given a set of columns (or keys) in the same form thatdak.inspect.report_necessary_columns
returns. (Closes #559)Why is this useful?
dak.inspect.report_necessary_columns
once for one dataset and then reuse its output to manually project these columns for other datasets.tests/test_manual.py
as an example.Note
Currently,
dak.manual.optimize_columns
will disable the possibility to project theAwkwardInputLayers again. This is because otherwise there might be unexpected results when running the manual optimization and then the automatic one that's enabled by default.Looking forward to your feedback!
PS: If this gets accepted, I'll open another PR in uproot that allows to perform the manual column optimization aswell.