-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ExtensionType
trait and CanonicalExtensionType
enum
#5822
base: main
Are you sure you want to change the base?
Conversation
Maybe ExtensionType could be a trait to be externally implementable and not limited to canonical extension types? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me, and seems like an unobtrusive way to provide better ergonomics for extension types.
That being said I've limited exposure to them so getting some broader perspectives might be valuable, perhaps on the mailing list or something?
I haven't had time to work on this, but I'm planning to pick this up later. |
Thanks @mbrobbel -- marking this PR as draft as I think it still has planned but not yet completed work |
Thanks for the PR. This seems very useful to support not yet mapped logical types. e.g. json |
|
6b883c5
to
e35630a
Compare
ExtensionType
for uuid
and map to parquet logical typeExtensionType
trait and CanonicalExtensionType
enum
…onst was added in 1.81
This is ready for another round of reviews. I updated the trait and changed some field methods, added implementations for the current set of canonical extension types (behind a feature flag) and added more tests and docs. Maybe this PR is too big now, but I don't mind splitting it up if needed. Roundtripping through parquet is not implemented in this PR. |
Thank you @mbrobbel -- I plan to review this carefully tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😮 -- thank you SO much @mbrobbel -- I looked at the structure of this PR and I really like it. It is really nicely done. I have some minor comments, but really I found this PR really well structured, commented and tested 🏅
This also implements the consensus reached on #4472 as I understand it
I have some API design questions, but I also think this is good enough to merge as is and I would be happy to maintain the code as is.
If there are no objections to this approach from other reviewers in theory I will find the time to review the tests a bit more closely before merging, but the ones I have looked at so far looks 👍 👍 👍
@westonpace @kylebarron @tustvold @yukkit perhap you would have some time to review / look over this PR as you were quite active on #4472
@emilk, @danking @jleibs, @crepererum and @findepi as you seem to have mentioned this PR in the past
/// | ||
/// [`Field`]: crate::Field | ||
/// [`Field::metadata`]: crate::Field::metadata | ||
fn serialize_metadata(&self) -> Option<String>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ever possible to have an error serializialzing the metadata? If so perhaps this should return Result<Option<String>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably Result<String>
is enough, I think None and "" in this case doesn't matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of a practical situation where serialization of the extension type metadata would error, but I'm fine with adding a Result
here.
The Option
is required to support extension types without metadata (they should return None
here). Note that this is different from ""
, because in that case the extension metadata key is added to the field custom metadata, whereas for None
this key is skipped.
So with support for serialization errors this would return: Option<Result<String, ArrowError>>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because in that case the extension metadata key is added to the field custom metadata, whereas for None this key is skipped.
I forget what version of Arrow C++ I fixed this in, but old versions or Arrow C++ (maybe older than ~a year) will segfault if ARROW:extension:metadata
is missing when importing via the C Data interface, so it might be worth always serializing an empty string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per https://arrow.apache.org/docs/format/Columnar.html#extension-types:
Extension types may or may not use the 'ARROW:extension:metadata' field.
Which I interpret as; it should also work when this key is missing in the field metadata for extension types that don't require metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's your call
Probably a better way to put that would have been: I don't have a strong feeling about it, but it did take me quite a long time to figure out why geoarrow-c was crashing causing pyarrow to crash 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can imagine. Thanks for the suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add a note in the comments and suggest for maximum compatibility extension types should always return Some("")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be in conflict with the docs for deserialize_metadata
:
An extension type that defines no metadata should expect
None
for the serialized metadata and returnOk(())
.
This function should return an error when unexpected metadata is set (for extension types without metadata)
We could add a note as you suggested for serialize_metadata
and then modify the comments in deserialize_metadata
to be suggestions only?
The Uuid
canonical extension type requires no metadata, so if we change the recommendation here we should also change that implementation maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least Arrow C++ and DuckDB will export ARROW:extension:metadata
as an empty string if there are no parameters, so something will have to handle that (if they are interested in interacting with other Arrow implementations). In GeoArrow I personally accept "{}"
, ""
or missing metadata.
/// | ||
/// # Example | ||
/// | ||
/// The example below demonstrates how to implement this trait for a `Uuid` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a really nice example and illustrates how this API works
/// | ||
/// If an extension type defines no metadata it should use `()` to indicate | ||
/// this. | ||
type Metadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what is the usecase for the Metadata indirection?
For instance, the example above seems like it would be simpler if it simply serialized / deserialized UuidVersion
directly
I think that would mean the API would look like
trait ExtensionType {
// serialize self to a string
fn serialize(&self) -> Option<String>;
// deserialize into self
fn deserialize(metadata: Option<&str>) -> Result<Self, ArrowError>;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Json, serde_json::Value
is the metadata instead of String, with your design I think we are not able to store metadata other than String
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think any metadata needs to be serialized to a string (which is the same as what the PR does)
fn serialize_metadata(&self) -> Option<String>;
I am just wondering why have a separate Rust type for Metadata
For eample, if I were doing this my first inclination would be to do
impl ExtensionType for MyAwesomeType {
...
type Metadata = Self;
....
}
So then I could just serialize/deserialize instances of my awesome type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. The only benefit is that we can have a separate clean struct IF the metadata becomes pretty complex. For usual case, simply se/de without type Metadata
is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work for all extension types. For example the FixedShapeTensor
and VariableShapeTensor
canonical extension types can't be constructed from just the metadata, they also need the field datatype (which explains the ExtensionType::try_new
signature).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it. Maybe we can add some comments explaining the rationale a bit more.
Or maybe we can default the Metadata
type to Self
type Metadata = Self;
To be clear we can do this as a follow on PR (or never). I don't think it is necessary for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately associated type defaults are not stabilized yet. I added a note in b78e692.
arrow-schema/src/extension/mod.rs
Outdated
|
||
//! Extension types. | ||
|
||
#[cfg(feature = "canonical-extension-types")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 for making this behind a feature flag
It would be nice to add this to the documentation:
Perhaps we could also note it is "experimental" and say we reserve the right to make breaking changes between even minor version releases -- that would give us flexibilty to iterate in the short term if necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could also note it is "experimental"
Do you want to add an experimental
feature to the arrow
crate (same as in the parquet
crate)?
Lines 112 to 113 in 1664214
# Experimental, unstable functionality primarily used for testing | |
experimental = [] |
Or just a note in the module docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding a note in the module docs is good enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a warning in 5fec56d.
/// Canonical extension types. | ||
/// | ||
/// <https://arrow.apache.org/docs/format/CanonicalExtensions.html#format-canonical-extensions> | ||
#[non_exhaustive] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for non exhaustive so we can add new extension types without breaking API changes
This looks great to me! I really like this approach |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks pretty nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops forgot to commit changes
// TODO: roundtrip | ||
// let arrow_schema = parquet_to_arrow_schema(&parquet_schema, None)?; | ||
// assert_eq!( | ||
// arrow_schema.field(0).try_extension_type::<Json>()?, | ||
// Json::default() | ||
// ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this left for a future PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this requires a bit more work to propagate correctly.
/// | ||
/// // We just return a reference to the Uuid version. | ||
/// fn metadata(&self) -> &Self::Metadata { | ||
/// &self.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first go at implementing this trait I used:
pub struct PointType(CoordType, Dimension);
impl ExtensionType for PointType {
const NAME: &'static str = "geoarrow.point";
type Metadata = (CoordType, Dimension);
fn metadata(&self) -> &Self::Metadata {
&(self.0, self.1)
}
}
But that won't work for this because cannot return reference to temporary value
.
So essentially type Metadata
needs to be the same as whatever is defined inside the extension type struct, and that needs to be a single item, so that it can return Self::Metadata
without creating any new items?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the current signature requires that. We could change it to return Self::Metadata
?
The pattern I used for the canonical extension types with metadata applied to your example:
pub struct PointType(PointTypeMetadata);
pub struct PointTypeMetadata(CoordType, Dimension);
impl ExtensionType for PointType {
const NAME: &'static str = "geoarrow.point";
type Metadata = PointTypeMetadata;
fn metadata(&self) -> &Self::Metadata {
&self.0
}
}
/// | ||
/// [`Field`]: crate::Field | ||
/// [`Field::metadata`]: crate::Field::metadata | ||
fn serialize_metadata(&self) -> Option<String>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting because it requires the type to contain all information to construct its Arrow extension metadata. This is not how I've implemented it so far in geoarrow-rs. For example, I define a Point
type as parameterized only by its "coordinate type" (i.e. FixedSizeList
coordinates or Struct[x: Float, y: Float]
coordinates) and its dimension (XY
, XYZ
, and so on). Separately, I store more metadata, and each array type stores both pieces of data.
So the data type alone is not enough information to serialize the full GeoArrow metadata, which also needs the Coordinate Reference System (CRS). It's a good question whether the CRS should be a part of the type here. Perhaps it should, but then we need to support CRS equality, which is quite tricky to do, so a single logical CRS can be represented in different ways.
cc @paleolimbot , this might be a question for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this is great, I'll take a better look at this tomorrow and figure out if the current definition works with GeoArrow.
Note that in the spec extension types are defined for Field, not for Array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consider the CRS a property of the type in geoarrow-c and geoarrow-python (which is also how it is implemented in the draft Parquet LogicalType, GeoParquet, and the draft Iceberg spec). My reading of this is that it's compatible with GeoArrow (and will be great!). (A Python implementation is here: https://github.com/geoarrow/geoarrow-python/blob/main/geoarrow-types/src/geoarrow/types/type_pyarrow.py ).
It's true that CRS equality is thorny (generally it's expressed as a probability)...I implement this as part of the type and implement ==
on the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like all the pieces are there (in geoarrow-rs
), so I guess it will be straightforward to add ExtensionType
implementations that can be used with the added extension type related methods of Field
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just passing by and had a lot of fun reading this code (thank you)!
Please just ignore my comments if you feel so.
impl ExtensionType for Bool8 { | ||
const NAME: &'static str = "arrow.bool8"; | ||
|
||
type Metadata = &'static str; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe bad idea: use ()
here. This way it's not possible to encode bad metadata value (any string that is not empty)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the proposed definition of the trait, ()
would indicate no metadata and results in the metadata key to be in unset in the field custom metadata map. Note that there is no actual storage for the metadata in this extension type. Both serialization and deserialization make sure the metadata is always an empty string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! Missed that! Thanks!
"" => Ok(""), | ||
_ => Err(ArrowError::InvalidArgumentError(ERR.to_owned())), | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slightly shorter:
) | |
if metadata.map_or(false, str::is_empty) { | |
Ok("") | |
} else { | |
Err(ArrowError::InvalidArgumentError("Bool8 extension type expects an empty string as metadata".into())) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// | ||
/// This type does not have any parameters. | ||
/// | ||
/// Metadata is either an empty string or a JSON string with an empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on implementation I guess it's not "an empty string" but "a JSON string with an empty string" (ie "\"\""
and not ""
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! 29a94cb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe bad idea: instead of an Option<Map<...>>
what about just using a is_empty_object: bool
field or something like this to make invalid states unrepresentable?
Also, you might even hack something like this to only deserialize the empty object:
#[derive(Deserialize)]
struct Empty {}
let is_empty_struct = serde_json::from_str<Empty>(value).is_ok()
and then use the Option<Empty>
type in JsonMetadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider doing this. The spec states (https://arrow.apache.org/docs/format/CanonicalExtensions.html#json):
Description of the serialization:
Metadata is either an empty string or a JSON string with an empty object. In the future, additional fields may be added, but they are not required to interpret the array.
So I figured, I'll support the future use-case, however, if/when in the future fields are added here it would actually be better to use a strongly typed definition. I like your suggestion of using Empty
here and since the metadata content currently is not pub
, if fields are added in the future, we can just rename and modify the definition of the inner metadata type without introducing a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! It might be possible to also use a similar approach for the Bool8
metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is currently no metadata storage for Bool8
, and the difference is that there is no note about future additions to the metadata serialization:
Description of the serialization:
Metadata is an empty string.
So, if there are breaking changes in the spec for this type, it would have to be a breaking change here too anyway.
I think we should leave this PR open at least through the weekend to gather additional feedback. I think given the size/scope of it perhaps we can plan to merge it after I cut the release candidate for the next arrow-rs release (mid-late next week) if there are no other concerns / comments. THank you again @mbrobbel for pushing this along |
Rationale for this change
It would be nice to better support reading and writing the Arrow canonical
uuid
andjson
extension types with the arrow and parquet crate i.e. mapping between the arrow extension type and the parquet logicaluuid
andjson
types.What changes are included in this PR?
This adds an
ExtensionType
trait, some impls for canonical extension types and aCanonicalExtensionType
enum for canonical extension types.Are there any user-facing changes?
Users can now annotate their fields with extension types, and for
uuid
andjson
they are propagated via the arrow writer to map to the parquetuuid
andjson
logical types.