Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: dask_awkward.bundle, a conditional-touching awkward.zip function #536

Open
NJManganelli opened this issue Aug 25, 2024 · 2 comments

Comments

@NJManganelli
Copy link

As discussed in a meeting a few weeks back between @ncsmith, @jpivarski @lgray, Peter F., etc., Nick S proposed dask_awkward.bundle as a variation of awkward.zip which does not hard-touch all columns. This would allow users to build traceable forms, and could help simplify/eliminate some machinery currently employed to build schemas in e.g. coffea. This would be eminently useful in column-joining, where we want to build a typetracer which has multiple inputs per-row (with schema applied over the union of input fields), run it through an analysis, dispatch only the minimally-necessary columns for transformation and joining through externel services, then deliver those columns for the actual computation.

To keep things uniform in awkward, it may also be nice to have an awkward.bundle which dispatches to dask_awkward.bundle or calls awkward.zip for eager arrays (acting as a synonym)

See:
scikit-hep/coffea#1163

@agoose77
Copy link
Collaborator

Somewhere we had a discussion about ak.zip and the way in which it touches. The practical reason that zip touches is that it broadcasts each column against the others, which may change the type of the array (new dims, reg to var, etc).

The statement that I increasingly find helpful to surface to frame these conversations is that typetracer doesn't provide guarantees that f(X) succeeds with type type[Y]. Instead, it says that f(X) succeeds with type[Y] or fails.

Due to these statements, if we want a non-touching zip, that is to say that we actually need a non-broadcasting zip. In turn, this requires that we kick the "are these columns compatible" question to runtime and fail there. We need to be able to describe the structure of the result, which means that if any shapes are unknown, the result must be (such that if any of the columns of the record are omitted, the shape is unchanged).

What's not so nice about this function is that it allows for code that would fail in an eager path to run in a delayed fashion; the error-causing branches may be pruned by optimisation.

If this is just about schema building, it might be better to do this for coffea-only (as coffea is the only client so far)?

@nsmith-
Copy link

nsmith- commented Sep 3, 2024

I had hoped that the ak.bundle function signature would require the user to specify exactly what can be assumed about the broadcast-compatibility and ultimate layout expected of the result of bundling inputs together, and that that would be baked into the output dask-awkward array. At runtime I suppose the same level of checks that we see happen with ak.from_buffers could happen but nothing beyond that, since this is intended to replace what we now do manually with a dak.from_map over ak.from_buffers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants