Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creation of many small files before merging #275

Open
nevencaplar opened this issue Apr 5, 2024 · 1 comment
Open

Creation of many small files before merging #275

nevencaplar opened this issue Apr 5, 2024 · 1 comment

Comments

@nevencaplar
Copy link
Member

When creating a new catalog, we create many small files by sharding each input file to conform to the output catalog and then merging all of the files that belong to the same healpix pixel. Explore a better way to do it (dask.shuffle?) without having to write many small files, which slows down the process.

@troyraen
Copy link
Collaborator

One thing I want to try on my next import is to consolidate the shards (per pixel) generated from a single input file before returning from the split_pixels function here. This should help for large input files that get split into many chunks by the reader. For small input files, I'll try #308 and then this consolidation should help with those as well.

So the same number of intermediate files would be written initially but they'd immediately be reduced so that the next steps can deal with a smaller number of files. This should help not only with the final "reducing" step, but also, a) make it easier to verify the intermediate dataset, which I'm planning for #118; and b) if/when something goes wrong with the import it will be easier to figure out what's actually on disk and then resume either splitting or reducing instead of starting over.

@nevencaplar nevencaplar removed the status in HATS / LSDB Aug 9, 2024
@nevencaplar nevencaplar moved this to Todo in HATS / LSDB Aug 22, 2024
@nevencaplar nevencaplar removed the status in HATS / LSDB Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants