Enhance `catalog.create_table` API to enable creation of table with matching `field_ids` to provided Schema #1284

sungwy · 2024-11-01T13:07:29Z

Feature Request / Improvement

Currently, create_table takes a pyiceberg.schema.Schema or pyarrow.Schema. The API ignores the field_ids provided in the schema and issues a set of new ones.

Not only does this cause confusion for the users, because pyiceberg.schema.Schema requires field_ids, but it also means that a table cannot be created with the guarantee that its schema will match the field IDs of the provided schema. This prevents the API from being able to be used for table migrations (discussion on example use case), where a user would want to take the following course of steps:

catalog.load_table to load the pyiceberg.table.Table of an existing Iceberg table.
get pyiceberg.schema.Schema of the loaded table
create a new table in the target catalog using catalog.create_table using the existing table's schema
copy over files into the new table using add_files (this is not possible yet, but there is a discussion that would allow add_files to work on a table with schema with matching field_ids)

Above procedure will not work, unless we introduce an enhancement to create_table that enables creation of a new table that matches the field_ids of the provided Schema.

One way of addressing this issue will be to have two ways of representing the Iceberg table's Schema.

If a schema without field_id is passed into the API, we should create a new pyiceberg.schema.Schema with newly assigned field_ids to create the table.
If a schema with field_id is passed into the API, we should use the exact same pyiceberg.schema.Schema to create the table.

I discuss a few ideas in achieving this below, all with their own pros and cons:

Create a subclass of `pyiceberg.schema.Schema` without `field_id`

This sounds like the best approach, but once explored, we quickly realize that this may be impossible. The main challenge with this approach is the fact that pyiceberg.schema.Schema describes its fields using NestedField which are nested structures of pydantic BaseModels with field_id as a required field. So there isn't a way to create a subclass class of pyiceberg.schema.Schema without field_id.

Create a variant class of `pyiceberg.schema.Schema` without `field_id`

This is a bit different from above approach, and requires us to make variant classes of pyiceberg.schema.Schema, that are not subclassed from it. This is not ideal, because we will have to maintain field_id-less copies of NestedField, StructType, MapType, ListType and Schema and create methods to create a field_id'ed Schema from its field_id-less variant. It is possible, but it will be hard and messy to manage.

We could make `field_id` an optional attribute of `NestedField` and field_id'd iceberg Types

This will allow us to create pyiceberg.schema.Schema with and without field_ids. However, this creates a new opportunity for issues to be introduced into PyIceberg, that have been prevented with the NestedField's attributes directly matching that of the REST Catalog spec. With field_id as an optional field, we will need to introduce a lot more validations across our code base to ensure that the field_id is set in all the nested fields within a schema before using it.

Keep the `field_id`s when using `pyiceberg.schema.Schema`, but generate new ones when using `pyarrow.Schema`

I think this may be the safest approach. This will be user-friendly: pyiceberg.schema.Schema requires field_ids, and pyarrow.Schema does not. pyarrow.Schema is also just a completely different class, so users do not hold the expectation that the field_ids within a pyarrow.Schema are kept in the pyiceberg.schema.Schema (although this is an enhancement we could introduce in the future). When and if we introduce new Schema representations as alternate input schema representations to the API, we could evaluate whether it would make sense to keep the field IDs or assign new ones case by case.

I am personally in favor of the last approach, to revert back to keeping the field_ids of the provided pyiceberg.schema.Schema, if it of that type. And if the input schema is a pyarrow.Schema, we'd create a new Schema with freshly assigned IDs. The behavior of the API will then feel more consistent with how our users are using it, and how they expect the field_ids of the created table Schema to be, in different scenarios.

I'd love to hear the thoughts of our community members on this topic before jumping onto an implementation.

The text was updated successfully, but these errors were encountered:

kevinjqliu · 2024-11-01T18:40:06Z

Thanks for the writeup!

Keep the field_ids when using pyiceberg.schema.Schema, but generate new ones when using pyarrow.Schema

+1 to this

Here's my take:
Currently, create_table can take in either a Pyiceberg Schema class (pyiceberg.schema.Schema) or PyArrow Schema class (pyarrow.Schema).
Inside create_table, PyArrow Schema is converted to PyIceberg Schema. And then the field_ids are all reassigned.

iceberg-python/pyiceberg/catalog/rest.py

Lines 573 to 576 in d559e53

    
           iceberg_schema = self._convert_schema_if_needed(schema) 
        
           fresh_schema = assign_fresh_schema_ids(iceberg_schema) 
        
           fresh_partition_spec = assign_fresh_partition_spec_ids(partition_spec, iceberg_schema, fresh_schema) 
        
           fresh_sort_order = assign_fresh_sort_order_ids(sort_order, iceberg_schema, fresh_schema)

There are 3 ways a user can interact with the create_table API:

Passing in a PyIceberg Schema that already has field_ids assigned
Passing in a PyIceberg Schema that needs field_ids assignment
Passing in a PyArrow Schema

For the first case, we should not reassign the field_ids, but we do today.
The the second case, we should assign field_ids as it is required.
The the third case, we should convert to PyIceberg Schema and assign field_ids.

I think it'll be more user-friendly for the create_table function to accept either

PyIceberg Schema with field_ids already assigned
PyArrow Schema, in which case, we'll do the conversion and field_ids assignment

Today, I think the only way to craft a PyIceberg Schema that needs field_ids assignment is either through creating it by hand or using the _convert_schema_if_needed function.
We might need a way to check the validity of the field_ids in case the Pyiceberg Schema was created by hand.
And we can always couple _convert_schema_if_needed with assign_fresh_schema_ids to ensure that the PyIceberg Schemas always have a valid field_id assignment.

I'm excited about this change. This might also help us streamline create_table with partition_spec and sort_order since there's been an issue with the field_id mismatch.

sungwy · 2024-11-01T18:51:13Z

Hey @kevinjqliu thanks for the support! Glad to hear we are on the same page regarding this issue.

As with all API changes, I think we would want to deliberate on these options carefully, so I'll keep this issue open for a bit longer for others to opine before moving on to implementation.

Fokko · 2024-11-04T13:40:39Z

Thanks for bringing this up!

I'm excited about this change. This might also help us streamline create_table with partition_spec and sort_order since there's been an issue with the field_id mismatch.

Indeed :) To provide some historical context. The idea was that PyIceberg was more of a library than an end-user tool, therefore it was always a second layer (like having PyArrow in front of it).

In general, I think the philosophy should be; that people don't have to worry about field IDs, and this should be hidden away. In the case of the table migration, that's some advanced situation where you don't want to re-assign the IDs. I can think of two options:

Having a flag to not assign fresh IDs
Ability to have a custom assigning class, so you can also do other ways of assigning the IDs (pre-order, post-order etc).

Keep in mind that the REST catalog also might re-assign IDs (I think the one we use in the integration tests also does this).

sungwy · 2024-11-08T22:56:14Z

Hi @Fokko , thank you for the suggestions! I just put up this PR to introduce a new flag assign_fresh_ids on create_table and related methods.

Keep in mind that the REST catalog also might re-assign IDs (I think the one we use in the integration tests also does this).

Yes I think this is extremely great to make note of. I ran into this issue when setting up the integration tests against the REST Catalog image and I was a bit confused. I understand that it is upto the REST Catalog Server to decide how it wants to create the table once the request is accepted, but I can't help but find it very counter-intuitive that the REST Catalog takes id assigned Schema, PartitionSpec and SortOrder and would still assign fresh IDs ☹️

Regardless, I'm excited to introduce this feature, so that we can support more migration workflows through PyIceberg - which is something many users have been hoping to do through Python applications 🚀

kevinjqliu · 2024-11-12T05:52:55Z

In general, I think the philosophy should be; that people don't have to worry about field IDs, and this should be hidden away.

+1

In the case of the table migration, that's some advanced situation where you don't want to re-assign the IDs

We can have a function to explicitly assign IDs, such as assign_fresh_schema_ids.
If Iceberg Schema is given, it will use its original assigned ID. If PyArrow Schemas is given, it will be converted to Iceberg Schemas and have IDs assigned.

@sungwy WDYT of this method instead of adding the assign_fresh_ids flag to the functions? I think it doesn't make sense for the create_table function to have a flag for schema ID assignment.

I guess the difference is, is there a case where I want to call create_table with Iceberg Schema and have assign_fresh_ids=False? Can the user just call assign_fresh_schema_ids(iceberg_schema) instead?
We don't care about PyArrow Schemas since they are always converted and assigned IDs

kevinjqliu · 2024-11-12T05:54:06Z

looks like in new_table_metadata the ID assignment is propagated to the partition_spec and sort_order as well

https://github.com/apache/iceberg-python/pull/1304/files#diff-a7fbd8c6b0564308ca872ba479f0e240ff2c893d15776a14e52b91b48e9ac470R524-R528

sungwy · 2024-11-13T17:28:36Z

If Iceberg Schema is given, it will use its original assigned ID. If PyArrow Schemas is given, it will be converted to Iceberg Schemas and have IDs assigned.

@kevinjqliu - Yes, I actually feel more inclined to drive this based on the type of the input schema. To me, the fact that the Iceberg Schema is required to have IDs assigned makes a very strong case for us to respect the given field IDs and propagate it to the Iceberg table metadata.

The case where some Iceberg Rest Catalog Servers do not respect the given field IDs and assigns fresh IDs actually feels more like an edge case to me, that we want to opine on and course correct. Ideally, when the IDs are specified on the Schema, it should be respected by the server. Else, the Schema supplied in a createTableRequest should not have field IDs assigned in the request.

This approach aligns with my initial proposal on the issue description ⬆️

I put up the PR with the assign_fresh_ids flag based on @Fokko 's concerns with above approach. Could I ask for your reviews on this PR, and maybe we can decide on the approach there? #1304

Fokko · 2024-11-27T07:15:51Z

Thanks for the context yesterday, I was still noodling on it overnight.

If I understand correctly (and please also share the video of you and @adrianqin; I must have missed it during the paternity leave), you're looking for a shallow clone of a table. In this situation, the metadata and manifests are recreated, but the re-use the existing data-files are re-used to avoid unnecessary heavy lifting. As I mentioned yesterday, this is still a bit tricky since you might do a delete operation, where a data-file is being dropped that's still referenced in the other table, but that's inherent to a shallow clone.

Instead of mangling the create_table operation, shouldn't it be better to have a clone operation? This would still hide the field-IDs from the end-user and give a much cleaner API to work with.

The case where some Iceberg Rest Catalog Servers do not respect the given field IDs and assigns fresh IDs actually feels more like an edge case to me, that we want to opine on and course correct.

The REST catalog follows the philosophy that clients shouldn't have to worry about Field-IDs. The register table is designed to take an existing file, and re-use all the metadata, rather than re-assigning it. But it looks like there is interest in more flavors than just create and register. A similar question was also raised on Slack: https://apache-iceberg.slack.com/archives/C029EE6HQ5D/p1732441495465079

sungwy self-assigned this Nov 1, 2024

Fokko mentioned this issue Nov 5, 2024

Support Defining PartitionSpec and SortOrder without field-ids in create_table #338

Open

sungwy mentioned this issue Nov 8, 2024

Introduce assign_fresh_ids flag and allow skipping fresh assignment of IDs on Table creation #1304

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance `catalog.create_table` API to enable creation of table with matching `field_ids` to provided Schema #1284

Enhance `catalog.create_table` API to enable creation of table with matching `field_ids` to provided Schema #1284

sungwy commented Nov 1, 2024 •

edited

Loading

kevinjqliu commented Nov 1, 2024

sungwy commented Nov 1, 2024

Fokko commented Nov 4, 2024

sungwy commented Nov 8, 2024 •

edited

Loading

kevinjqliu commented Nov 12, 2024

kevinjqliu commented Nov 12, 2024

sungwy commented Nov 13, 2024 •

edited

Loading

Fokko commented Nov 27, 2024 •

edited

Loading

Enhance catalog.create_table API to enable creation of table with matching field_ids to provided Schema #1284

Enhance catalog.create_table API to enable creation of table with matching field_ids to provided Schema #1284

Comments

sungwy commented Nov 1, 2024 • edited Loading

Feature Request / Improvement

Create a subclass of pyiceberg.schema.Schema without field_id

Create a variant class of pyiceberg.schema.Schema without field_id

We could make field_id an optional attribute of NestedField and field_id'd iceberg Types

Keep the field_ids when using pyiceberg.schema.Schema, but generate new ones when using pyarrow.Schema

kevinjqliu commented Nov 1, 2024

sungwy commented Nov 1, 2024

Fokko commented Nov 4, 2024

sungwy commented Nov 8, 2024 • edited Loading

kevinjqliu commented Nov 12, 2024

kevinjqliu commented Nov 12, 2024

sungwy commented Nov 13, 2024 • edited Loading

Fokko commented Nov 27, 2024 • edited Loading

Enhance `catalog.create_table` API to enable creation of table with matching `field_ids` to provided Schema #1284

Enhance `catalog.create_table` API to enable creation of table with matching `field_ids` to provided Schema #1284

sungwy commented Nov 1, 2024 •

edited

Loading

Create a subclass of `pyiceberg.schema.Schema` without `field_id`

Create a variant class of `pyiceberg.schema.Schema` without `field_id`

We could make `field_id` an optional attribute of `NestedField` and field_id'd iceberg Types

Keep the `field_id`s when using `pyiceberg.schema.Schema`, but generate new ones when using `pyarrow.Schema`

sungwy commented Nov 8, 2024 •

edited

Loading

sungwy commented Nov 13, 2024 •

edited

Loading

Fokko commented Nov 27, 2024 •

edited

Loading