Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance catalog.create_table API to enable creation of table with matching field_ids to provided Schema #1284

Open
sungwy opened this issue Nov 1, 2024 · 8 comments
Assignees

Comments

@sungwy
Copy link
Collaborator

sungwy commented Nov 1, 2024

Feature Request / Improvement

Currently, create_table takes a pyiceberg.schema.Schema or pyarrow.Schema. The API ignores the field_ids provided in the schema and issues a set of new ones.

Not only does this cause confusion for the users, because pyiceberg.schema.Schema requires field_ids, but it also means that a table cannot be created with the guarantee that its schema will match the field IDs of the provided schema. This prevents the API from being able to be used for table migrations (discussion on example use case), where a user would want to take the following course of steps:

  1. catalog.load_table to load the pyiceberg.table.Table of an existing Iceberg table.
  2. get pyiceberg.schema.Schema of the loaded table
  3. create a new table in the target catalog using catalog.create_table using the existing table's schema
  4. copy over files into the new table using add_files (this is not possible yet, but there is a discussion that would allow add_files to work on a table with schema with matching field_ids)

Above procedure will not work, unless we introduce an enhancement to create_table that enables creation of a new table that matches the field_ids of the provided Schema.

One way of addressing this issue will be to have two ways of representing the Iceberg table's Schema.

  1. If a schema without field_id is passed into the API, we should create a new pyiceberg.schema.Schema with newly assigned field_ids to create the table.
  2. If a schema with field_id is passed into the API, we should use the exact same pyiceberg.schema.Schema to create the table.

I discuss a few ideas in achieving this below, all with their own pros and cons:

Create a subclass of pyiceberg.schema.Schema without field_id

This sounds like the best approach, but once explored, we quickly realize that this may be impossible. The main challenge with this approach is the fact that pyiceberg.schema.Schema describes its fields using NestedField which are nested structures of pydantic BaseModels with field_id as a required field. So there isn't a way to create a subclass class of pyiceberg.schema.Schema without field_id.

Create a variant class of pyiceberg.schema.Schema without field_id

This is a bit different from above approach, and requires us to make variant classes of pyiceberg.schema.Schema, that are not subclassed from it. This is not ideal, because we will have to maintain field_id-less copies of NestedField, StructType, MapType, ListType and Schema and create methods to create a field_id'ed Schema from its field_id-less variant. It is possible, but it will be hard and messy to manage.

We could make field_id an optional attribute of NestedField and field_id'd iceberg Types

This will allow us to create pyiceberg.schema.Schema with and without field_ids. However, this creates a new opportunity for issues to be introduced into PyIceberg, that have been prevented with the NestedField's attributes directly matching that of the REST Catalog spec. With field_id as an optional field, we will need to introduce a lot more validations across our code base to ensure that the field_id is set in all the nested fields within a schema before using it.

Keep the field_ids when using pyiceberg.schema.Schema, but generate new ones when using pyarrow.Schema

I think this may be the safest approach. This will be user-friendly: pyiceberg.schema.Schema requires field_ids, and pyarrow.Schema does not. pyarrow.Schema is also just a completely different class, so users do not hold the expectation that the field_ids within a pyarrow.Schema are kept in the pyiceberg.schema.Schema (although this is an enhancement we could introduce in the future). When and if we introduce new Schema representations as alternate input schema representations to the API, we could evaluate whether it would make sense to keep the field IDs or assign new ones case by case.

I am personally in favor of the last approach, to revert back to keeping the field_ids of the provided pyiceberg.schema.Schema, if it of that type. And if the input schema is a pyarrow.Schema, we'd create a new Schema with freshly assigned IDs. The behavior of the API will then feel more consistent with how our users are using it, and how they expect the field_ids of the created table Schema to be, in different scenarios.

I'd love to hear the thoughts of our community members on this topic before jumping onto an implementation.

@kevinjqliu
Copy link
Contributor

Thanks for the writeup!

Keep the field_ids when using pyiceberg.schema.Schema, but generate new ones when using pyarrow.Schema

+1 to this

Here's my take:
Currently, create_table can take in either a Pyiceberg Schema class (pyiceberg.schema.Schema) or PyArrow Schema class (pyarrow.Schema).
Inside create_table, PyArrow Schema is converted to PyIceberg Schema. And then the field_ids are all reassigned.

iceberg_schema = self._convert_schema_if_needed(schema)
fresh_schema = assign_fresh_schema_ids(iceberg_schema)
fresh_partition_spec = assign_fresh_partition_spec_ids(partition_spec, iceberg_schema, fresh_schema)
fresh_sort_order = assign_fresh_sort_order_ids(sort_order, iceberg_schema, fresh_schema)

There are 3 ways a user can interact with the create_table API:

  • Passing in a PyIceberg Schema that already has field_ids assigned
  • Passing in a PyIceberg Schema that needs field_ids assignment
  • Passing in a PyArrow Schema

For the first case, we should not reassign the field_ids, but we do today.
The the second case, we should assign field_ids as it is required.
The the third case, we should convert to PyIceberg Schema and assign field_ids.

I think it'll be more user-friendly for the create_table function to accept either

  1. PyIceberg Schema with field_ids already assigned
  2. PyArrow Schema, in which case, we'll do the conversion and field_ids assignment

Today, I think the only way to craft a PyIceberg Schema that needs field_ids assignment is either through creating it by hand or using the _convert_schema_if_needed function.
We might need a way to check the validity of the field_ids in case the Pyiceberg Schema was created by hand.
And we can always couple _convert_schema_if_needed with assign_fresh_schema_ids to ensure that the PyIceberg Schemas always have a valid field_id assignment.

I'm excited about this change. This might also help us streamline create_table with partition_spec and sort_order since there's been an issue with the field_id mismatch.

@sungwy
Copy link
Collaborator Author

sungwy commented Nov 1, 2024

Hey @kevinjqliu thanks for the support! Glad to hear we are on the same page regarding this issue.

As with all API changes, I think we would want to deliberate on these options carefully, so I'll keep this issue open for a bit longer for others to opine before moving on to implementation.

@sungwy sungwy self-assigned this Nov 1, 2024
@Fokko
Copy link
Contributor

Fokko commented Nov 4, 2024

Thanks for bringing this up!

I'm excited about this change. This might also help us streamline create_table with partition_spec and sort_order since there's been an issue with the field_id mismatch.

Indeed :) To provide some historical context. The idea was that PyIceberg was more of a library than an end-user tool, therefore it was always a second layer (like having PyArrow in front of it).

In general, I think the philosophy should be; that people don't have to worry about field IDs, and this should be hidden away. In the case of the table migration, that's some advanced situation where you don't want to re-assign the IDs. I can think of two options:

  • Having a flag to not assign fresh IDs
  • Ability to have a custom assigning class, so you can also do other ways of assigning the IDs (pre-order, post-order etc).

Keep in mind that the REST catalog also might re-assign IDs (I think the one we use in the integration tests also does this).

@sungwy
Copy link
Collaborator Author

sungwy commented Nov 8, 2024

Hi @Fokko , thank you for the suggestions! I just put up this PR to introduce a new flag assign_fresh_ids on create_table and related methods.

Keep in mind that the REST catalog also might re-assign IDs (I think the one we use in the integration tests also does this).

Yes I think this is extremely great to make note of. I ran into this issue when setting up the integration tests against the REST Catalog image and I was a bit confused. I understand that it is upto the REST Catalog Server to decide how it wants to create the table once the request is accepted, but I can't help but find it very counter-intuitive that the REST Catalog takes id assigned Schema, PartitionSpec and SortOrder and would still assign fresh IDs ☹️

Regardless, I'm excited to introduce this feature, so that we can support more migration workflows through PyIceberg - which is something many users have been hoping to do through Python applications 🚀

@kevinjqliu
Copy link
Contributor

In general, I think the philosophy should be; that people don't have to worry about field IDs, and this should be hidden away.

+1

In the case of the table migration, that's some advanced situation where you don't want to re-assign the IDs

We can have a function to explicitly assign IDs, such as assign_fresh_schema_ids.
If Iceberg Schema is given, it will use its original assigned ID. If PyArrow Schemas is given, it will be converted to Iceberg Schemas and have IDs assigned.

@sungwy WDYT of this method instead of adding the assign_fresh_ids flag to the functions? I think it doesn't make sense for the create_table function to have a flag for schema ID assignment.

I guess the difference is, is there a case where I want to call create_table with Iceberg Schema and have assign_fresh_ids=False? Can the user just call assign_fresh_schema_ids(iceberg_schema) instead?
We don't care about PyArrow Schemas since they are always converted and assigned IDs

@kevinjqliu
Copy link
Contributor

looks like in new_table_metadata the ID assignment is propagated to the partition_spec and sort_order as well

https://github.com/apache/iceberg-python/pull/1304/files#diff-a7fbd8c6b0564308ca872ba479f0e240ff2c893d15776a14e52b91b48e9ac470R524-R528

@sungwy
Copy link
Collaborator Author

sungwy commented Nov 13, 2024

If Iceberg Schema is given, it will use its original assigned ID. If PyArrow Schemas is given, it will be converted to Iceberg Schemas and have IDs assigned.

@kevinjqliu - Yes, I actually feel more inclined to drive this based on the type of the input schema. To me, the fact that the Iceberg Schema is required to have IDs assigned makes a very strong case for us to respect the given field IDs and propagate it to the Iceberg table metadata.

The case where some Iceberg Rest Catalog Servers do not respect the given field IDs and assigns fresh IDs actually feels more like an edge case to me, that we want to opine on and course correct. Ideally, when the IDs are specified on the Schema, it should be respected by the server. Else, the Schema supplied in a createTableRequest should not have field IDs assigned in the request.

This approach aligns with my initial proposal on the issue description ⬆️

I put up the PR with the assign_fresh_ids flag based on @Fokko 's concerns with above approach. Could I ask for your reviews on this PR, and maybe we can decide on the approach there? #1304

@Fokko
Copy link
Contributor

Fokko commented Nov 27, 2024

Thanks for the context yesterday, I was still noodling on it overnight.

If I understand correctly (and please also share the video of you and @adrianqin; I must have missed it during the paternity leave), you're looking for a shallow clone of a table. In this situation, the metadata and manifests are recreated, but the re-use the existing data-files are re-used to avoid unnecessary heavy lifting. As I mentioned yesterday, this is still a bit tricky since you might do a delete operation, where a data-file is being dropped that's still referenced in the other table, but that's inherent to a shallow clone.

Instead of mangling the create_table operation, shouldn't it be better to have a clone operation? This would still hide the field-IDs from the end-user and give a much cleaner API to work with.

The case where some Iceberg Rest Catalog Servers do not respect the given field IDs and assigns fresh IDs actually feels more like an edge case to me, that we want to opine on and course correct.

The REST catalog follows the philosophy that clients shouldn't have to worry about Field-IDs. The register table is designed to take an existing file, and re-use all the metadata, rather than re-assigning it. But it looks like there is interest in more flavors than just create and register. A similar question was also raised on Slack: https://apache-iceberg.slack.com/archives/C029EE6HQ5D/p1732441495465079

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants