fix: pyarrow `unique` in `group_by` context #1076

FBruzzesi · 2024-09-26T20:55:42Z

What type of PR is this? (check all applicable)

Related issues

May come in handy for plotly

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below.

I was not able to add tests... I tried to nest a bunch of checks but also the order inside the list type is not guaranteed..
any idea?

MarcoGorelli · 2024-09-27T09:13:18Z

thanks @FBruzzesi !

I think plotly would only need to get some value from the aggregation, rather than a list dtypes?

perhaps we could allow

df.group_by('a').agg(nw.unique_value('b'))  # raises if there's more than 1 unique value per group
df.group_by('a').agg(nw.unique_value('b', fallback_value='(?)'))

Something like this could help address the mode issue you'd spotted in skrub, iirc they just wanted to get a single value from the mode out, right?

FBruzzesi · 2024-09-27T09:27:14Z

Not sure if this is the right place for this discussion but here we go 🙃

I think plotly would only need to get some value from the aggregation, rather than a list dtypes?

Yes correct!

perhaps we could allow

df.group_by('a').agg(nw.unique_value('b'))  # raises if there's more than 1 unique value per group
df.group_by('a').agg(nw.unique_value('b', fallback_value='(?)'))

Not the biggest fan of this if we are going to support nw.List type - a list is an aggregated value, and surprisingly pandas seems to behave quite well - at least for .unique 🙈

Something like this could help address the mode issue you'd spotted in skrub, iirc they just wanted to get a single value from the mode out, right?

Correct again.

MarcoGorelli · 2024-09-27T09:35:49Z

surprisingly pandas seems to behave quite well

yeah but it returns object dtype and I fear that'd create more issues for us down the line

FBruzzesi · 2024-09-27T09:52:34Z

yeah but it returns object dtype and I fear that'd create more issues for us down the line

Yes that's not ideal, and yesterday I had issues converting to list type (e.g. .astype('pyarrow[list]') is not enough).

Maybe let's sleep on this, but I would imagine that someone using narwhals should just be a bit more pedantic and do:

(df
.group_by("a")
.agg(nw.col("b").unique()))
.with_columns(nw.col("b").cast(nw.List(...)))  # force it to be list type
... # now can access .list namespace
)

MarcoGorelli · 2024-09-27T10:23:20Z

i'm not sure that people would think to do that explicit cast, and implementing the list namespace would be quite difficult for pandas

we may be able to take inspiration from duckdb here, who have any_value as an aggregate function https://duckdb.org/docs/sql/functions/aggregates.html#any_valuearg

>>> rel = duckdb.read_parquet('../scratch/assets.parquet')
>>> duckdb.sql('select symbol, any_value(date) from rel group by symbol')
┌─────────┬─────────────────┐
│ symbol  │ any_value(date) │
│ varchar │      date       │
├─────────┼─────────────────┤
│ EWJ     │ 2022-01-31      │
│ OGN     │ 2022-01-31      │
│ PRU     │ 2022-01-31      │
│ AEP     │ 2022-01-31      │
│ ALLE    │ 2022-01-31      │
│ IEFM.L  │ 2022-01-31      │
│ EWG     │ 2022-01-31      │
│ SEGA.L  │ 2022-01-31      │
│ IAU     │ 2022-01-31      │
│ XLV     │ 2022-01-31      │
│  ·      │     ·           │
│  ·      │     ·           │
│  ·      │     ·           │
│ CNC     │ 2022-01-31      │
│ CTAS    │ 2022-01-31      │
│ DG      │ 2022-01-31      │
│ IEF     │ 2022-05-31      │
│ IEMG    │ 2022-01-31      │
│ JPEA.L  │ 2022-01-31      │
│ META    │ 2022-01-31      │
│ HIGH.L  │ 2022-03-17      │
│ HST     │ 2022-01-31      │
│ VXX     │ 2022-01-31      │
├─────────┴─────────────────┤
│    100 rows (20 shown)    │
└───────────────────────────┘

So, my thinking was that unique_value would be kind of like any_value, but only works if there's only a unique value per group

If it's a top-level function (nw.unique_value) then I think it'd be ok to depart from Polars a bit there, we have other non-Polars function in the top-level narwhals namespace

FBruzzesi · 2024-09-27T11:19:33Z

Just for clarity, when you say:

Something like this could help address the mode issue you'd spotted in skrub, iirc they just wanted to get a single value from the mode out, right?

does it mean that nw.unique_value('b') can receive an expression (e.g. in the skrub case nw.unique_value(nw.col('b').mode()))?

MarcoGorelli · 2024-09-27T11:57:21Z

I haven't tried implementing it yet, but yes, I think so

alternatively, we could add:

nw.unique_value
nw.unique_mode

Alternatively, we could have our own Agg class and do something like

df.group_by('a').agg(nw.Agg.unique_mode('b'))

fix: pyarrow group_by unique

8c9246e

github-actions bot added the fix label Sep 26, 2024

MarcoGorelli mentioned this pull request Sep 27, 2024

feat: add nw.Struct , nw.List, and nw.Array dtypes #1067

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: pyarrow `unique` in `group_by` context #1076

fix: pyarrow `unique` in `group_by` context #1076

FBruzzesi commented Sep 26, 2024

MarcoGorelli commented Sep 27, 2024

FBruzzesi commented Sep 27, 2024

MarcoGorelli commented Sep 27, 2024

FBruzzesi commented Sep 27, 2024

MarcoGorelli commented Sep 27, 2024

FBruzzesi commented Sep 27, 2024

MarcoGorelli commented Sep 27, 2024 •

edited

Loading

fix: pyarrow unique in group_by context #1076

Are you sure you want to change the base?

fix: pyarrow unique in group_by context #1076

Conversation

FBruzzesi commented Sep 26, 2024

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below.

MarcoGorelli commented Sep 27, 2024

FBruzzesi commented Sep 27, 2024

MarcoGorelli commented Sep 27, 2024

FBruzzesi commented Sep 27, 2024

MarcoGorelli commented Sep 27, 2024

FBruzzesi commented Sep 27, 2024

MarcoGorelli commented Sep 27, 2024 • edited Loading

fix: pyarrow `unique` in `group_by` context #1076

fix: pyarrow `unique` in `group_by` context #1076

MarcoGorelli commented Sep 27, 2024 •

edited

Loading