Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results with pl.struct().value_counts().struct.unnest() #20927

Closed
2 tasks done
etrotta opened this issue Jan 26, 2025 · 0 comments · Fixed by #20929
Closed
2 tasks done

Incorrect results with pl.struct().value_counts().struct.unnest() #20927

etrotta opened this issue Jan 26, 2025 · 0 comments · Fixed by #20929
Assignees
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@etrotta
Copy link
Contributor

etrotta commented Jan 26, 2025

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

test_df = pl.DataFrame({"x": [i for i in range(1, 6) for _ in range(i)]})
# Run it multiple times - the order of the results varies each time
test_df.select(pl.struct("x").value_counts().struct.unnest())

Log output

>>> test_df.select(pl.struct("x").value_counts().struct.unnest())
shape: (5, 2)
┌───────────┬───────┐
│ x         ┆ count │
│ ---       ┆ ---   │
│ struct[1] ┆ u32   │
╞═══════════╪═══════╡
│ {2}       ┆ 4     │
│ {5}       ┆ 5     │
│ {3}       ┆ 3     │
│ {1}       ┆ 1     │
│ {4}       ┆ 2     │
└───────────┴───────┘
>>> test_df.select(pl.struct("x").value_counts().struct.unnest())
shape: (5, 2)
┌───────────┬───────┐
│ x         ┆ count │
│ ---       ┆ ---   │
│ struct[1] ┆ u32   │
╞═══════════╪═══════╡
│ {1}       ┆ 5     │
│ {5}       ┆ 3     │
│ {3}       ┆ 2     │
│ {4}       ┆ 1     │
│ {2}       ┆ 4     │
└───────────┴───────┘

Issue description

The results vary each time you run the example code, as if it were randomly shuffling the columns separately from each other.

It works correctly if you do it in separate steps or use pl.col('x') instead of pl.struct('x')
(Reduced the minimal example as much as I could while still reproducing the issue - in my real case there were multiple columns like pl.struct('x', 'y'))

Expected behavior

It should return the count associated with the correct values for each row.

It works correctly if you do it in two separate steps, it still gets shuffled but with the correct value - count combinations instead of independently shuffling each column.

>>> test_df.select(pl.struct("x").value_counts()).select(pl.all().struct.unnest())
shape: (5, 2)
┌───────────┬───────┐
│ xcount │
│ ------   │
│ struct[1] ┆ u32   │
╞═══════════╪═══════╡
│ {4}       ┆ 4     │
│ {1}       ┆ 1     │
│ {2}       ┆ 2     │
│ {5}       ┆ 5     │
│ {3}       ┆ 3     │
└───────────┴───────┘

Installed versions

--------Version info---------
Polars:              1.21.0
Index type:          UInt32
Platform:            Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39
Python:              3.12.3 (main, Jan 17 2025, 18:03:48) [GCC 13.3.0]
LTS CPU:             False

----Optional dependencies----
Azure CLI            2.67.0
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                2.2.2
openpyxl             <not installed>
pandas               2.2.3
pyarrow              <not installed>
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@etrotta etrotta added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 26, 2025
@ritchie46 ritchie46 self-assigned this Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants