Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support Bucket and Truncate transforms on write #1345

Merged
merged 13 commits into from
Jan 16, 2025

Conversation

sungwy
Copy link
Collaborator

@sungwy sungwy commented Nov 20, 2024

Getting the PR ready for when pyiceberg_core is released from iceberg-rust

PR to introduce python binding release: apache/iceberg-rust#705

Fixes: #1074

Consideration: we could replace the existing pyarrow dependency on order_preserving transforms (Month,Year,Date) with pyiceberg_core for consistency

@kevinjqliu kevinjqliu self-requested a review December 19, 2024 17:15
@sungwy sungwy marked this pull request as ready for review December 24, 2024 18:35
@sungwy sungwy changed the title Introduce bucket transform feat: Support bucket and Truncate transforms on write Dec 24, 2024
@sungwy sungwy changed the title feat: Support bucket and Truncate transforms on write feat: Support Bucket and Truncate transforms on write Dec 24, 2024
@sungwy sungwy requested a review from Fokko December 24, 2024 20:45
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great to have writes for all the different transformations!

@pytest.mark.parametrize(
"spec, expected_rows",
[
# none of non-identity is supported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# none of non-identity is supported

Comment on lines +1580 to +1583
source_type: PrimitiveType,
input_arr: Union[pa.Array, pa.ChunkedArray],
expected: Union[pa.Array, pa.ChunkedArray],
num_buckets: int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wydt of reordering these for readability? num_buckets, source_type and input_arr are configs of the BucketTransform; expected is the output

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think I feel indifferent here - there’s something nice about having the input and expected arrays side by side

@kevinjqliu kevinjqliu added this to the PyIceberg 0.9.0 release milestone Jan 8, 2025
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for rebasing the PR! I left a few questions on the tests.

We're currently blocked by CI since spark 3.5.3 is removed from https://dlcdn.apache.org/spark/

(PartitionSpec(PartitionField(source_id=4, field_id=1001, transform=TruncateTransform(2), name="int_trunc"))),
(PartitionSpec(PartitionField(source_id=5, field_id=1001, transform=TruncateTransform(2), name="long_trunc"))),
(PartitionSpec(PartitionField(source_id=2, field_id=1001, transform=TruncateTransform(2), name="string_trunc"))),
(PartitionSpec(PartitionField(source_id=11, field_id=1001, transform=TruncateTransform(2), name="binary_trunc"))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we include binary_trunc too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good question. Truncating binary isn't supported with iceberg-rust so I've excluded this test case for now: https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/transform/truncate.rs#L132-L164

Comment on lines 763 to 769
# mixed with non-identity is not supported
(
PartitionSpec(
PartitionField(source_id=4, field_id=1001, transform=BucketTransform(2), name="int_bucket"),
PartitionField(source_id=1, field_id=1002, transform=IdentityTransform(), name="bool"),
)
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this case supported now?

def _pyiceberg_transform_wrapper(
self, transform_func: Callable[["ArrayLike", Any], "ArrayLike"], *args: Any
) -> Callable[["ArrayLike"], "ArrayLike"]:
import pyarrow as pa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import pyarrow as pa
try:
import pyarrow as pa
except ModuleNotFoundError as e:
raise ModuleNotFoundError("For bucket/truncate transforms, PyArrow needs to be installed") from e

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long wait, I think this one got buried in my mailbox. I left one minor nit, this looks great @sungwy 🙌

@sungwy
Copy link
Collaborator Author

sungwy commented Jan 16, 2025

Sorry for the long wait, I think this one got buried in my mailbox. I left one minor nit, this looks great @sungwy 🙌

no problem @Fokko - I'm just coming back from holidays myself.

And thank you for taking another round of reviews @kevinjqliu ! I'll take the nits and retrigger the CI now that the spark artifact issue has been fixed

@Fokko
Copy link
Contributor

Fokko commented Jan 16, 2025

Thanks @sungwy and I hope you had some great time off :)

@Fokko Fokko merged commit 50c33aa into apache:main Jan 16, 2025
7 checks passed
@sungwy sungwy deleted the bucket-transforms branch January 16, 2025 15:55
@Fokko Fokko mentioned this pull request Jan 20, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support writes to Bucket Partitioned Tables
3 participants