-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance catalog.create_table
API to enable creation of table with matching field_ids
to provided Schema
#1284
Comments
Thanks for the writeup!
+1 to this Here's my take: iceberg-python/pyiceberg/catalog/rest.py Lines 573 to 576 in d559e53
There are 3 ways a user can interact with the
For the first case, we should not reassign the field_ids, but we do today. I think it'll be more user-friendly for the
Today, I think the only way to craft a PyIceberg Schema that needs field_ids assignment is either through creating it by hand or using the I'm excited about this change. This might also help us streamline |
Hey @kevinjqliu thanks for the support! Glad to hear we are on the same page regarding this issue. As with all API changes, I think we would want to deliberate on these options carefully, so I'll keep this issue open for a bit longer for others to opine before moving on to implementation. |
Thanks for bringing this up!
Indeed :) To provide some historical context. The idea was that PyIceberg was more of a library than an end-user tool, therefore it was always a second layer (like having PyArrow in front of it). In general, I think the philosophy should be; that people don't have to worry about field IDs, and this should be hidden away. In the case of the table migration, that's some advanced situation where you don't want to re-assign the IDs. I can think of two options:
Keep in mind that the REST catalog also might re-assign IDs (I think the one we use in the integration tests also does this). |
Hi @Fokko , thank you for the suggestions! I just put up this PR to introduce a new flag
Yes I think this is extremely great to make note of. I ran into this issue when setting up the integration tests against the REST Catalog image and I was a bit confused. I understand that it is upto the REST Catalog Server to decide how it wants to create the table once the request is accepted, but I can't help but find it very counter-intuitive that the REST Catalog takes id assigned Schema, PartitionSpec and SortOrder and would still assign fresh IDs Regardless, I'm excited to introduce this feature, so that we can support more migration workflows through PyIceberg - which is something many users have been hoping to do through Python applications 🚀 |
+1
We can have a function to explicitly assign IDs, such as @sungwy WDYT of this method instead of adding the I guess the difference is, is there a case where I want to call |
looks like in |
@kevinjqliu - Yes, I actually feel more inclined to drive this based on the type of the input schema. To me, the fact that the Iceberg Schema is required to have IDs assigned makes a very strong case for us to respect the given field IDs and propagate it to the Iceberg table metadata. The case where some Iceberg Rest Catalog Servers do not respect the given field IDs and assigns fresh IDs actually feels more like an edge case to me, that we want to opine on and course correct. Ideally, when the IDs are specified on the Schema, it should be respected by the server. Else, the Schema supplied in a createTableRequest should not have field IDs assigned in the request. This approach aligns with my initial proposal on the issue description ⬆️ I put up the PR with the |
Thanks for the context yesterday, I was still noodling on it overnight. If I understand correctly (and please also share the video of you and @adrianqin; I must have missed it during the paternity leave), you're looking for a shallow clone of a table. In this situation, the metadata and manifests are recreated, but the re-use the existing data-files are re-used to avoid unnecessary heavy lifting. As I mentioned yesterday, this is still a bit tricky since you might do a delete operation, where a data-file is being dropped that's still referenced in the other table, but that's inherent to a shallow clone. Instead of mangling the
The REST catalog follows the philosophy that clients shouldn't have to worry about Field-IDs. The register table is designed to take an existing file, and re-use all the metadata, rather than re-assigning it. But it looks like there is interest in more flavors than just |
Feature Request / Improvement
Currently,
create_table
takes apyiceberg.schema.Schema
orpyarrow.Schema
. The API ignores thefield_id
s provided in the schema and issues a set of new ones.Not only does this cause confusion for the users, because
pyiceberg.schema.Schema
requiresfield_id
s, but it also means that a table cannot be created with the guarantee that its schema will match the field IDs of the provided schema. This prevents the API from being able to be used for table migrations (discussion on example use case), where a user would want to take the following course of steps:catalog.load_table
to load thepyiceberg.table.Table
of an existing Iceberg table.pyiceberg.schema.Schema
of the loaded tablecatalog.create_table
using the existing table's schemaadd_files
(this is not possible yet, but there is a discussion that would allowadd_files
to work on a table with schema with matching field_ids)Above procedure will not work, unless we introduce an enhancement to
create_table
that enables creation of a new table that matches the field_ids of the provided Schema.One way of addressing this issue will be to have two ways of representing the Iceberg table's Schema.
pyiceberg.schema.Schema
with newly assigned field_ids to create the table.pyiceberg.schema.Schema
to create the table.I discuss a few ideas in achieving this below, all with their own pros and cons:
Create a subclass of
pyiceberg.schema.Schema
withoutfield_id
This sounds like the best approach, but once explored, we quickly realize that this may be impossible. The main challenge with this approach is the fact that
pyiceberg.schema.Schema
describes its fields usingNestedField
which are nested structures of pydantic BaseModels withfield_id
as a required field. So there isn't a way to create a subclass class ofpyiceberg.schema.Schema
withoutfield_id
.Create a variant class of
pyiceberg.schema.Schema
withoutfield_id
This is a bit different from above approach, and requires us to make variant classes of
pyiceberg.schema.Schema
, that are not subclassed from it. This is not ideal, because we will have to maintain field_id-less copies of NestedField, StructType, MapType, ListType and Schema and create methods to create a field_id'ed Schema from its field_id-less variant. It is possible, but it will be hard and messy to manage.We could make
field_id
an optional attribute ofNestedField
and field_id'd iceberg TypesThis will allow us to create
pyiceberg.schema.Schema
with and without field_ids. However, this creates a new opportunity for issues to be introduced into PyIceberg, that have been prevented with the NestedField's attributes directly matching that of the REST Catalog spec. Withfield_id
as an optional field, we will need to introduce a lot more validations across our code base to ensure that the field_id is set in all the nested fields within a schema before using it.Keep the
field_id
s when usingpyiceberg.schema.Schema
, but generate new ones when usingpyarrow.Schema
I think this may be the safest approach. This will be user-friendly:
pyiceberg.schema.Schema
requiresfield_id
s, andpyarrow.Schema
does not.pyarrow.Schema
is also just a completely different class, so users do not hold the expectation that thefield_id
s within apyarrow.Schema
are kept in thepyiceberg.schema.Schema
(although this is an enhancement we could introduce in the future). When and if we introduce new Schema representations as alternate input schema representations to the API, we could evaluate whether it would make sense to keep the field IDs or assign new ones case by case.I am personally in favor of the last approach, to revert back to keeping the field_ids of the provided
pyiceberg.schema.Schema
, if it of that type. And if the input schema is apyarrow.Schema
, we'd create a new Schema with freshly assigned IDs. The behavior of the API will then feel more consistent with how our users are using it, and how they expect the field_ids of the created table Schema to be, in different scenarios.I'd love to hear the thoughts of our community members on this topic before jumping onto an implementation.
The text was updated successfully, but these errors were encountered: