How many row groups per file should we aim for? #400

troyraen · 2024-10-02T01:29:26Z

My understanding is that we want a small-ish number of row groups per file because that makes the dataset loads faster. On the flip side, I imagine that putting too many rows in a group would make row lookups slower. So there's probably a sweet spot, but I haven't really tested this out. In my experience, most catalogs end up with about 1-4 row groups per file (given defaults for the writer kwargs) but I've run into a very different case and am wondering what to do.

I found that the ZTF Light Curves hipscat catalog I made has a mean of 22 row groups per file, with almost 2000 files having more than 40 row groups each. That must have happened because this dataset is skinny (<15 columns), so a 500MB file has 10s of millions of rows (hipscat-import doesn’t specify a max-rows-per-group kwarg so must be using pyarrow’s default which maxes out at ~1 million). Since this is a 10TB dataset, the cumulative difference in efficiency between large/small row groups could be significant and I'd rather not put out a product that's extra hard to work with. I may be re-making these files before we release them anyway, or at least making additional products as ZTF puts out new data releases, so I could adjust this then.

Does anyone have a sense of whether reducing the number of row groups would make this dataset easier to use and/or at what point it would be counter productive because there'd be too many rows in each group?

nevencaplar · 2024-10-10T17:52:12Z

This was discussed during the Friday office hours, October 04, 2024.

The conclusion is that we do not have a firm suggestion at this point. Further benchmarking is needed to answer this question.

delucchi-cmu added this to HATS / LSDB Oct 2, 2024

troyraen added the question Further information is requested label Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How many row groups per file should we aim for? #400

How many row groups per file should we aim for? #400

troyraen commented Oct 2, 2024

nevencaplar commented Oct 10, 2024

How many row groups per file should we aim for? #400

How many row groups per file should we aim for? #400

Comments

troyraen commented Oct 2, 2024

nevencaplar commented Oct 10, 2024