-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for maintain_order param in joins #17698
base: branch-25.02
Are you sure you want to change the base?
Add support for maintain_order param in joins #17698
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Matt711, I think we are doing too much work in the "none" case.
right_order = plc.copying.gather( | ||
plc.Table([plc.filling.sequence(right_rows, init, step)]), rg, right_policy | ||
) | ||
if maintain_order in {"none", "left_right", "right_left"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue/question: If we have no obligation maintain_order == "none"
I think we should not be doing any work, what is happening here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're correct, I'll also need to update the other tests in the suite since polars defaults to "none" . So I'll add maintain_order="left"
to ensure those tests are reproducible.
left_order = plc.copying.gather( | ||
plc.Table([plc.filling.sequence(left_rows, init, step)]), | ||
lg, | ||
left_policy, | ||
) | ||
right_order = plc.copying.gather( | ||
plc.Table([plc.filling.sequence(right_rows, init, step)]), | ||
rg, | ||
right_policy, | ||
) | ||
elif maintain_order == "left": | ||
left_order = plc.copying.gather( | ||
plc.Table([plc.filling.sequence(left_rows, init, step)]), | ||
lg, | ||
left_policy, | ||
) | ||
elif maintain_order == "right": | ||
right_order = plc.copying.gather( | ||
plc.Table([plc.filling.sequence(right_rows, init, step)]), | ||
rg, | ||
right_policy, | ||
) | ||
if maintain_order == "left": | ||
sort_keys = left_order.columns() | ||
elif maintain_order == "right": | ||
sort_keys = right_order.columns() | ||
elif maintain_order in {"none", "left_right"}: | ||
sort_keys = left_order.columns() + right_order.columns() | ||
elif maintain_order == "right_left": | ||
sort_keys = right_order.columns() + left_order.columns() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid repetition here by just immediately making the sort_keys
list?
# Reorder maps based on maintain_order | ||
lg, rg = cls._reorder_maps( | ||
left.num_rows, | ||
lg, | ||
left_policy, | ||
right.num_rows, | ||
rg, | ||
right_policy, | ||
maintain_order, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Reorder maps based on maintain_order | |
lg, rg = cls._reorder_maps( | |
left.num_rows, | |
lg, | |
left_policy, | |
right.num_rows, | |
rg, | |
right_policy, | |
maintain_order, | |
) | |
if maintain_order != "none": | |
lg, rg = cls._reorder_maps( | |
left.num_rows, | |
lg, | |
left_policy, | |
right.num_rows, | |
rg, | |
right_policy, | |
maintain_order, | |
) |
@@ -1195,6 +1192,7 @@ def _reorder_maps( | |||
right_rows: int, | |||
rg: plc.Column, | |||
right_policy: plc.copying.OutOfBoundsPolicy, | |||
maintain_order: Literal["none", "left", "right", "left_right", "right_left"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maintain_order: Literal["none", "left", "right", "left_right", "right_left"], | |
maintain_order: Literal["left", "right", "left_right", "right_left"], |
Or accept "none"
but just return the input maps immediately.
) | ||
if maintain_order == "left": | ||
sort_keys = left_order.columns() | ||
elif maintain_order == "right": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the reviewer: This PR needs more work, but I'm opening it up for review so I can get some help handling a special case: full right joins. Specifically, the case where the test fails is when there are unmatched keys in the left dataframe. Any advice on how to handle this?
Example:
left = pl.LazyFrame(
{
"a": [1, 2, 3, 1, None],
"b": [1, 2, 3, 4, 5],
"c": [2, 3, 4, 5, 6],
}
)
right = pl.LazyFrame(
{
"a": [1, 4, 3, 7, None, None, 1],
"c": [2, 3, 4, 5, 6, 7, 8],
"d": [6, None, 7, 8, -1, 2, 4],
}
)
q = left.join(right, on=pl.col("a"), how="full", maintain_order="right")
q.collect(engine="gpu")
The dataframe differ at column "a"
AssertionError: DataFrames are different (value mismatch for column 'a')
[left]: [1, 1, None, 3, None, None, None, 1, 1, **None, 2]**
[right]: [1, 1, None, 3, None, None, None, 1, 1, **2, None]**
The a=2
entry is unmatched in the right dataframe, so it should be appended to the end of the result, not included with the other matches.
Description
Closes #17696
Checklist