Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for maintain_order param in joins #17698

Open
wants to merge 21 commits into
base: branch-25.02
Choose a base branch
from

Conversation

Matt711
Copy link
Contributor

@Matt711 Matt711 commented Jan 8, 2025

Description

Closes #17696

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@Matt711 Matt711 added feature request New feature or request non-breaking Non-breaking change labels Jan 8, 2025
Copy link

copy-pr-bot bot commented Jan 8, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars labels Jan 8, 2025
@Matt711 Matt711 marked this pull request as ready for review January 9, 2025 15:39
@Matt711 Matt711 requested a review from a team as a code owner January 9, 2025 15:39
@Matt711 Matt711 requested a review from wence- January 9, 2025 15:39
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Matt711, I think we are doing too much work in the "none" case.

right_order = plc.copying.gather(
plc.Table([plc.filling.sequence(right_rows, init, step)]), rg, right_policy
)
if maintain_order in {"none", "left_right", "right_left"}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue/question: If we have no obligation maintain_order == "none" I think we should not be doing any work, what is happening here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're correct, I'll also need to update the other tests in the suite since polars defaults to "none" . So I'll add maintain_order="left" to ensure those tests are reproducible.

Comment on lines +1225 to +1254
left_order = plc.copying.gather(
plc.Table([plc.filling.sequence(left_rows, init, step)]),
lg,
left_policy,
)
right_order = plc.copying.gather(
plc.Table([plc.filling.sequence(right_rows, init, step)]),
rg,
right_policy,
)
elif maintain_order == "left":
left_order = plc.copying.gather(
plc.Table([plc.filling.sequence(left_rows, init, step)]),
lg,
left_policy,
)
elif maintain_order == "right":
right_order = plc.copying.gather(
plc.Table([plc.filling.sequence(right_rows, init, step)]),
rg,
right_policy,
)
if maintain_order == "left":
sort_keys = left_order.columns()
elif maintain_order == "right":
sort_keys = right_order.columns()
elif maintain_order in {"none", "left_right"}:
sort_keys = left_order.columns() + right_order.columns()
elif maintain_order == "right_left":
sort_keys = right_order.columns() + left_order.columns()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid repetition here by just immediately making the sort_keys list?

Comment on lines +1322 to +1331
# Reorder maps based on maintain_order
lg, rg = cls._reorder_maps(
left.num_rows,
lg,
left_policy,
right.num_rows,
rg,
right_policy,
maintain_order,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Reorder maps based on maintain_order
lg, rg = cls._reorder_maps(
left.num_rows,
lg,
left_policy,
right.num_rows,
rg,
right_policy,
maintain_order,
)
if maintain_order != "none":
lg, rg = cls._reorder_maps(
left.num_rows,
lg,
left_policy,
right.num_rows,
rg,
right_policy,
maintain_order,
)

@@ -1195,6 +1192,7 @@ def _reorder_maps(
right_rows: int,
rg: plc.Column,
right_policy: plc.copying.OutOfBoundsPolicy,
maintain_order: Literal["none", "left", "right", "left_right", "right_left"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
maintain_order: Literal["none", "left", "right", "left_right", "right_left"],
maintain_order: Literal["left", "right", "left_right", "right_left"],

Or accept "none" but just return the input maps immediately.

)
if maintain_order == "left":
sort_keys = left_order.columns()
elif maintain_order == "right":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the reviewer: This PR needs more work, but I'm opening it up for review so I can get some help handling a special case: full right joins. Specifically, the case where the test fails is when there are unmatched keys in the left dataframe. Any advice on how to handle this?

Example:

left = pl.LazyFrame(
    {
        "a": [1, 2, 3, 1, None],
        "b": [1, 2, 3, 4, 5],
        "c": [2, 3, 4, 5, 6],
    }
)
right = pl.LazyFrame(
    {
        "a": [1, 4, 3, 7, None, None, 1],
        "c": [2, 3, 4, 5, 6, 7, 8],
        "d": [6, None, 7, 8, -1, 2, 4],
    }
)
q = left.join(right, on=pl.col("a"), how="full", maintain_order="right")
q.collect(engine="gpu")

The dataframe differ at column "a"

AssertionError: DataFrames are different (value mismatch for column 'a')
[left]:  [1, 1, None, 3, None, None, None, 1, 1, **None, 2]**
[right]: [1, 1, None, 3, None, None, None, 1, 1, **2, None]**

The a=2 entry is unmatched in the right dataframe, so it should be appended to the end of the result, not included with the other matches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.polars Issues specific to cudf.polars feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

[FEA] Add support for maintain_order param in joins
2 participants