You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My understanding is that we want a small-ish number of row groups per file because that makes the dataset loads faster. On the flip side, I imagine that putting too many rows in a group would make row lookups slower. So there's probably a sweet spot, but I haven't really tested this out. In my experience, most catalogs end up with about 1-4 row groups per file (given defaults for the writer kwargs) but I've run into a very different case and am wondering what to do.
I found that the ZTF Light Curves hipscat catalog I made has a mean of 22 row groups per file, with almost 2000 files having more than 40 row groups each. That must have happened because this dataset is skinny (<15 columns), so a 500MB file has 10s of millions of rows (hipscat-import doesn’t specify a max-rows-per-group kwarg so must be using pyarrow’s default which maxes out at ~1 million). Since this is a 10TB dataset, the cumulative difference in efficiency between large/small row groups could be significant and I'd rather not put out a product that's extra hard to work with. I may be re-making these files before we release them anyway, or at least making additional products as ZTF puts out new data releases, so I could adjust this then.
Does anyone have a sense of whether reducing the number of row groups would make this dataset easier to use and/or at what point it would be counter productive because there'd be too many rows in each group?
The text was updated successfully, but these errors were encountered:
My understanding is that we want a small-ish number of row groups per file because that makes the dataset loads faster. On the flip side, I imagine that putting too many rows in a group would make row lookups slower. So there's probably a sweet spot, but I haven't really tested this out. In my experience, most catalogs end up with about 1-4 row groups per file (given defaults for the writer kwargs) but I've run into a very different case and am wondering what to do.
I found that the ZTF Light Curves hipscat catalog I made has a mean of 22 row groups per file, with almost 2000 files having more than 40 row groups each. That must have happened because this dataset is skinny (<15 columns), so a 500MB file has 10s of millions of rows (hipscat-import doesn’t specify a max-rows-per-group kwarg so must be using pyarrow’s default which maxes out at ~1 million). Since this is a 10TB dataset, the cumulative difference in efficiency between large/small row groups could be significant and I'd rather not put out a product that's extra hard to work with. I may be re-making these files before we release them anyway, or at least making additional products as ZTF puts out new data releases, so I could adjust this then.
Does anyone have a sense of whether reducing the number of row groups would make this dataset easier to use and/or at what point it would be counter productive because there'd be too many rows in each group?
The text was updated successfully, but these errors were encountered: