Support Defining PartitionSpec and SortOrder without field-ids in create_table #338

sungwy · 2024-01-31T15:44:31Z

Feature Request / Improvement

Currently, create_table API only supports defining partition fields and sort fields by using PartitionSpec and SortOrder respectively.

PartitionField and SortField have the constraint that their fields are defined using field-ids.

With #305 we now allow users to define new table schema for create_table operation without field-ids. This aligns with the usage pattern in Spark Iceberg DDLs, and allows users to use a PyArrow schema that may not have field-ids in creating an Iceberg Table.

Similarly, we would like to support defining PartitionSpec and SortOrder without field_ids when we call create_table.

One early idea involves making changes to create_table API and perhaps the underlying functions (like assign_fresh_partition_spec_ids and assign_fresh_sort_order_ids) to allow users to define partition specs and sort order on create_table without field IDs (since new field IDs are generated for these tables anyways).

Another idea includes creating the schema first without committing the table creation using stage-create, and then using the generated schema and Partition Evolution to commit the new table.

sungwy · 2024-03-26T15:49:14Z

Related: #498

jiaoew1991 · 2024-05-10T07:21:16Z

Hi @syun64 How is this issue going? I have been tormented by this restriction for several days and still haven't figured it out. 😓

sungwy · 2024-05-11T20:49:40Z

Hi @syun64 How is this issue going? I have been tormented by this restriction for several days and still haven't figured it out. 😓

This will be a new feature that will be released in the upcoming 0.7.0 release, and will be supported on all currently supported catalogs.

Reference PR: #498

Samreay · 2024-11-05T07:40:45Z

Hey @sungwy, just thought I'd chase this as well. The PR you linked is merged and 0.7.1 is now out, so does that mean there is a new way of specifying sort order we can use with pyarrow schemas? I've been trying to do it the way recommended by the doco and still running into ValueError

import polars as pl
from pyiceberg.catalog.glue import GlueCatalog
from pyiceberg.table.sorting import SortField, SortOrder
from pyiceberg.transforms import IdentityTransform

# Using the arrow schema as per https://py.iceberg.apache.org/#write-a-pyarrow-dataframe
df = pl.DataFrame({"a": [1, 2, 3, 4, 5], "b": [5, 4, 3, 2, 1]}).to_arrow()

glue_catalog = GlueCatalog(name="", properties={"write.parquet.compression-codec": "snappy"})
table = glue_catalog.create_table(
    identifier="dev-cleaned.tmp_iceberg",
    schema=df.schema,
    location="s3:///mybucket/tmp",
    sort_order=SortOrder(SortField(source_id=1, transform=IdentityTransform())),
)

Gives:

Exception has occurred: ValueError
Could not find in old schema: 1 ASC NULLS FIRST
  File "/home/sam/arenko/flows-datalake/tmp_arrow.py", line 10, in <module>
    table = glue_catalog.create_table(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Could not find in old schema: 1 ASC NULLS FIRST

sungwy · 2024-11-05T13:59:54Z

Hi @Samreay thank you for chasing up!

We are now able to express partition spec updates without referencing a field_id by using the create_table_transaction method on a catalog. This will create a new transaction that can be used with a builder pattern to add schema or partition spec updates as noted in the last paragraph of the API documentation on Create a Table.

Unfortunately, we don't yet have similar support for updating the sort order without field_ids, and it is partially because PyIceberg as a library doesn't yet make use of a sort_order even if it is defined on a table. But I can see how it might still be useful to support that, so that the sort_order can be set through PyIceberg, but be used by other engines.

Would you be interested in making a contribution, to enable updating the sort_order through PyIceberg?

Fokko · 2024-11-05T14:03:32Z

We are now able to express partition spec updates without referencing a field_id by using the create_table_transaction method on a catalog.

If you're interested, this is being tracked in #1284

This will create a new transaction that can be used with a builder pattern to add schema or partition spec updates as noted in the last paragraph of the API documentation on Create a Table.

Looking at the docs, should we get rid of the first example? In general, I think it is best to show a single example to the user, instead of confusing them with multiple ways of doing the same thing. WDYT?

sungwy · 2024-11-05T14:23:04Z

Yes, @Fokko - this is exactly the type of user confusion that prompted me to create the issue for #1284 to separate the behavior based based on the input type for the schema argument.

I think adding a flag like you suggested, and showing a single example are both good suggestions to reduce the amount of confusion coming in from the community - I'll allocate some time for both of those items later today

sungwy mentioned this issue Jan 31, 2024

Improve the InMemory Catalog Implementation #289

Merged

sungwy added this to the PyIceberg 0.7.0 release milestone Feb 7, 2024

sungwy removed this from the PyIceberg 0.7.0 release milestone Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Defining PartitionSpec and SortOrder without field-ids in create_table #338

Support Defining PartitionSpec and SortOrder without field-ids in create_table #338

sungwy commented Jan 31, 2024

sungwy commented Mar 26, 2024

jiaoew1991 commented May 10, 2024

sungwy commented May 11, 2024

Samreay commented Nov 5, 2024

sungwy commented Nov 5, 2024

Fokko commented Nov 5, 2024 •

edited

Loading

sungwy commented Nov 5, 2024

Support Defining PartitionSpec and SortOrder without field-ids in create_table #338

Support Defining PartitionSpec and SortOrder without field-ids in create_table #338

Comments

sungwy commented Jan 31, 2024

Feature Request / Improvement

sungwy commented Mar 26, 2024

jiaoew1991 commented May 10, 2024

sungwy commented May 11, 2024

Samreay commented Nov 5, 2024

sungwy commented Nov 5, 2024

Fokko commented Nov 5, 2024 • edited Loading

sungwy commented Nov 5, 2024

Fokko commented Nov 5, 2024 •

edited

Loading