You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When creating a new catalog, we create many small files by sharding each input file to conform to the output catalog and then merging all of the files that belong to the same healpix pixel. Explore a better way to do it (dask.shuffle?) without having to write many small files, which slows down the process.
The text was updated successfully, but these errors were encountered:
One thing I want to try on my next import is to consolidate the shards (per pixel) generated from a single input file before returning from the split_pixels function here. This should help for large input files that get split into many chunks by the reader. For small input files, I'll try #308 and then this consolidation should help with those as well.
So the same number of intermediate files would be written initially but they'd immediately be reduced so that the next steps can deal with a smaller number of files. This should help not only with the final "reducing" step, but also, a) make it easier to verify the intermediate dataset, which I'm planning for #118; and b) if/when something goes wrong with the import it will be easier to figure out what's actually on disk and then resume either splitting or reducing instead of starting over.
When creating a new catalog, we create many small files by sharding each input file to conform to the output catalog and then merging all of the files that belong to the same healpix pixel. Explore a better way to do it (dask.shuffle?) without having to write many small files, which slows down the process.
The text was updated successfully, but these errors were encountered: