Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore dask expressions support using ddf.partitions and map_partitions instead of delayed #300

Open
smcguire-cmu opened this issue Apr 26, 2024 · 2 comments
Assignees

Comments

@smcguire-cmu
Copy link
Contributor

Instead of using dask delayed to align and map over the partitions of our catalogs, we could try to use the ddf.partitions accessor to align the partitions as necessary and map_partitions over them. There are still questions over how we deal with empty partitions and divisions, but may be an approach to look into.

@nevencaplar
Copy link
Member

@smcguire-cmu Can you remind me, is this still active?

@smcguire-cmu
Copy link
Contributor Author

So I looked into this a while back. I switched out our join and crossmatch implementations to use map_partitions instead of delayed and saw some improvement in lazy operation time, but not orders of magnitude improvements.

Advantages:

  • faster lazy operations
  • if dask allows specifying necessary columns in map partitions, we can get the dask expressions query optimization to work through our functions

Disadvantages:

  • I think the code is less readable and less well organized than with delayed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants