-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support Bucket and Truncate transforms on write #1345
Conversation
560ba20
to
bd80f39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great to have writes for all the different transformations!
@pytest.mark.parametrize( | ||
"spec, expected_rows", | ||
[ | ||
# none of non-identity is supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# none of non-identity is supported |
source_type: PrimitiveType, | ||
input_arr: Union[pa.Array, pa.ChunkedArray], | ||
expected: Union[pa.Array, pa.ChunkedArray], | ||
num_buckets: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wydt of reordering these for readability? num_buckets
, source_type
and input_arr
are configs of the BucketTransform; expected is the output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think I feel indifferent here - there’s something nice about having the input and expected arrays side by side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for rebasing the PR! I left a few questions on the tests.
We're currently blocked by CI since spark 3.5.3 is removed from https://dlcdn.apache.org/spark/
(PartitionSpec(PartitionField(source_id=4, field_id=1001, transform=TruncateTransform(2), name="int_trunc"))), | ||
(PartitionSpec(PartitionField(source_id=5, field_id=1001, transform=TruncateTransform(2), name="long_trunc"))), | ||
(PartitionSpec(PartitionField(source_id=2, field_id=1001, transform=TruncateTransform(2), name="string_trunc"))), | ||
(PartitionSpec(PartitionField(source_id=11, field_id=1001, transform=TruncateTransform(2), name="binary_trunc"))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we include binary_trunc
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good question. Truncating binary isn't supported with iceberg-rust so I've excluded this test case for now: https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/transform/truncate.rs#L132-L164
# mixed with non-identity is not supported | ||
( | ||
PartitionSpec( | ||
PartitionField(source_id=4, field_id=1001, transform=BucketTransform(2), name="int_bucket"), | ||
PartitionField(source_id=1, field_id=1002, transform=IdentityTransform(), name="bool"), | ||
) | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this case supported now?
pyiceberg/transforms.py
Outdated
def _pyiceberg_transform_wrapper( | ||
self, transform_func: Callable[["ArrayLike", Any], "ArrayLike"], *args: Any | ||
) -> Callable[["ArrayLike"], "ArrayLike"]: | ||
import pyarrow as pa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import pyarrow as pa | |
try: | |
import pyarrow as pa | |
except ModuleNotFoundError as e: | |
raise ModuleNotFoundError("For bucket/truncate transforms, PyArrow needs to be installed") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long wait, I think this one got buried in my mailbox. I left one minor nit, this looks great @sungwy 🙌
no problem @Fokko - I'm just coming back from holidays myself. And thank you for taking another round of reviews @kevinjqliu ! I'll take the nits and retrigger the CI now that the spark artifact issue has been fixed |
Thanks @sungwy and I hope you had some great time off :) |
Getting the PR ready for when
pyiceberg_core
is released fromiceberg-rust
PR to introduce python binding release: apache/iceberg-rust#705
Fixes: #1074
Consideration: we could replace the existing
pyarrow
dependency onorder_preserving
transforms (Month,Year,Date) withpyiceberg_core
for consistency