Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join on enums fails to match when sinking to parquet #20916

Open
2 tasks done
mdavis-xyz opened this issue Jan 25, 2025 · 2 comments
Open
2 tasks done

Join on enums fails to match when sinking to parquet #20916

mdavis-xyz opened this issue Jan 25, 2025 · 2 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer new-streaming Features for or dependent on the new streaming engine python Related to Python Polars wontfix

Comments

@mdavis-xyz
Copy link
Contributor

mdavis-xyz commented Jan 25, 2025

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

tmp_path = '/tmp/test2.parquet'

et = pl.Enum(['x', 'y', 'z'])


lf = (
    pl.LazyFrame({'DUID': ['x', 'y']}).cast({"DUID": et})
    .join(pl.LazyFrame([{'DUID': 'x', 'TLF': 0.9}]).cast({"DUID": et}),
          on="DUID", 
          how="left")
)

print(lf.collect())

lf.sink_parquet(tmp_path)

print(
    pl.scan_parquet(tmp_path)
    .collect()
)

Log output

join parallel: true
LEFT join dataframes finished
shape: (2, 2)
┌──────┬──────┐
│ DUID ┆ TLF  │
│ ---  ┆ ---  │
│ enum ┆ f64  │
╞══════╪══════╡
│ x    ┆ 0.9  │
│ y    ┆ null │
└──────┴──────┘
try_get_writeable: local: /tmp/test2.parquet (canonicalize: Ok("/tmp/test2.parquet"))
RUN STREAMING PIPELINE
[df -> hstack -> callback -> parquet_sink, df -> hstack -> generic_join_build]
parquet scan with parallel = Columns
shape: (2, 2)
┌──────┬──────┐
│ DUID ┆ TLF  │
│ ---  ┆ ---  │
│ enum ┆ f64  │
╞══════╪══════╡
│ x    ┆ null │
│ y    ┆ null │
└──────┴──────┘

Issue description

When joining on a key that's an enum, if I just call .collect(), the rows match as expected. If I instead .sink_parquet(), no rows are matched.

Expected behavior

Both printed dataframes should be identical. Both should have a non-null value in TLF for DUID=x.

Installed versions

--------Version info---------
Polars:              1.21.0
Index type:          UInt32
Platform:            Linux-6.8.0-1018-azure-x86_64-with-glibc2.39
Python:              3.12.3 (main, Nov  6 2024, 18:32:19) [GCC 13.2.0]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                <not installed>
openpyxl             <not installed>
pandas               <not installed>
pyarrow              <not installed>
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

@mdavis-xyz mdavis-xyz added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 25, 2025
@deanm0000
Copy link
Collaborator

What happens if you .collect(streaming=True)? I'm assuming it fails the same as the sink. Also try again with the new streaming by setting os.environ["POLARS_FORCE_NEW_STREAMING"]="1"

@ritchie46
Copy link
Member

This won't be fixed as we will switch to the new streaming engine soon.

@ritchie46 ritchie46 added new-streaming Features for or dependent on the new streaming engine wontfix labels Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer new-streaming Features for or dependent on the new streaming engine python Related to Python Polars wontfix
Projects
None yet
Development

No branches or pull requests

3 participants