Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How many row groups per file should we aim for? #400

Open
troyraen opened this issue Oct 2, 2024 · 1 comment
Open

How many row groups per file should we aim for? #400

troyraen opened this issue Oct 2, 2024 · 1 comment
Labels
question Further information is requested

Comments

@troyraen
Copy link
Collaborator

troyraen commented Oct 2, 2024

My understanding is that we want a small-ish number of row groups per file because that makes the dataset loads faster. On the flip side, I imagine that putting too many rows in a group would make row lookups slower. So there's probably a sweet spot, but I haven't really tested this out. In my experience, most catalogs end up with about 1-4 row groups per file (given defaults for the writer kwargs) but I've run into a very different case and am wondering what to do.

I found that the ZTF Light Curves hipscat catalog I made has a mean of 22 row groups per file, with almost 2000 files having more than 40 row groups each. That must have happened because this dataset is skinny (<15 columns), so a 500MB file has 10s of millions of rows (hipscat-import doesn’t specify a max-rows-per-group kwarg so must be using pyarrow’s default which maxes out at ~1 million). Since this is a 10TB dataset, the cumulative difference in efficiency between large/small row groups could be significant and I'd rather not put out a product that's extra hard to work with. I may be re-making these files before we release them anyway, or at least making additional products as ZTF puts out new data releases, so I could adjust this then.

Does anyone have a sense of whether reducing the number of row groups would make this dataset easier to use and/or at what point it would be counter productive because there'd be too many rows in each group?

@troyraen troyraen added the question Further information is requested label Oct 2, 2024
@nevencaplar
Copy link
Member

This was discussed during the Friday office hours, October 04, 2024.

The conclusion is that we do not have a firm suggestion at this point. Further benchmarking is needed to answer this question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: No status
Development

No branches or pull requests

2 participants