Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ExtensionType trait and CanonicalExtensionType enum #5822

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

mbrobbel
Copy link
Contributor

@mbrobbel mbrobbel commented May 31, 2024

Rationale for this change

It would be nice to better support reading and writing the Arrow canonical uuid and json extension types with the arrow and parquet crate i.e. mapping between the arrow extension type and the parquet logical uuid and json types.

What changes are included in this PR?

This adds an ExtensionType trait, some impls for canonical extension types and a CanonicalExtensionType enum for canonical extension types.

Are there any user-facing changes?

Users can now annotate their fields with extension types, and for uuid and json they are propagated via the arrow writer to map to the parquet uuid and json logical types.

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels May 31, 2024
@kylebarron
Copy link
Contributor

Maybe ExtensionType could be a trait to be externally implementable and not limited to canonical extension types?

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, and seems like an unobtrusive way to provide better ergonomics for extension types.

That being said I've limited exposure to them so getting some broader perspectives might be valuable, perhaps on the mailing list or something?

@mbrobbel
Copy link
Contributor Author

I haven't had time to work on this, but I'm planning to pick this up later.

@alamb
Copy link
Contributor

alamb commented Jul 1, 2024

Thanks @mbrobbel -- marking this PR as draft as I think it still has planned but not yet completed work

@alamb alamb marked this pull request as draft July 1, 2024 19:17
@aykut-bozkurt
Copy link
Contributor

Thanks for the PR. This seems very useful to support not yet mapped logical types. e.g. json

@mbrobbel
Copy link
Contributor Author

mbrobbel commented Sep 26, 2024

I updated the PR to define a trait for extension types instead. Ready for another round of feedback.
Edit: just realized I need to change some trait methods to make it work for other extension types.

@mbrobbel mbrobbel force-pushed the parquet-uuid-schema branch from 6b883c5 to e35630a Compare January 17, 2025 22:49
@mbrobbel mbrobbel changed the title Add ExtensionType for uuid and map to parquet logical type Add ExtensionType trait and CanonicalExtensionType enum Jan 17, 2025
@mbrobbel mbrobbel marked this pull request as ready for review January 20, 2025 18:59
@mbrobbel
Copy link
Contributor Author

This is ready for another round of reviews.

I updated the trait and changed some field methods, added implementations for the current set of canonical extension types (behind a feature flag) and added more tests and docs. Maybe this PR is too big now, but I don't mind splitting it up if needed.

Roundtripping through parquet is not implemented in this PR.

@alamb
Copy link
Contributor

alamb commented Jan 21, 2025

Thank you @mbrobbel -- I plan to review this carefully tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮 -- thank you SO much @mbrobbel -- I looked at the structure of this PR and I really like it. It is really nicely done. I have some minor comments, but really I found this PR really well structured, commented and tested 🏅

This also implements the consensus reached on #4472 as I understand it

I have some API design questions, but I also think this is good enough to merge as is and I would be happy to maintain the code as is.

If there are no objections to this approach from other reviewers in theory I will find the time to review the tests a bit more closely before merging, but the ones I have looked at so far looks 👍 👍 👍

@westonpace @kylebarron @tustvold @yukkit perhap you would have some time to review / look over this PR as you were quite active on #4472

@emilk, @danking @jleibs, @crepererum and @findepi as you seem to have mentioned this PR in the past

///
/// [`Field`]: crate::Field
/// [`Field::metadata`]: crate::Field::metadata
fn serialize_metadata(&self) -> Option<String>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ever possible to have an error serializialzing the metadata? If so perhaps this should return Result<Option<String>>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably Result<String> is enough, I think None and "" in this case doesn't matter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a practical situation where serialization of the extension type metadata would error, but I'm fine with adding a Result here.

The Option is required to support extension types without metadata (they should return None here). Note that this is different from "", because in that case the extension metadata key is added to the field custom metadata, whereas for None this key is skipped.

So with support for serialization errors this would return: Option<Result<String, ArrowError>>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because in that case the extension metadata key is added to the field custom metadata, whereas for None this key is skipped.

I forget what version of Arrow C++ I fixed this in, but old versions or Arrow C++ (maybe older than ~a year) will segfault if ARROW:extension:metadata is missing when importing via the C Data interface, so it might be worth always serializing an empty string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per https://arrow.apache.org/docs/format/Columnar.html#extension-types:

Extension types may or may not use the 'ARROW:extension:metadata' field.

Which I interpret as; it should also work when this key is missing in the field metadata for extension types that don't require metadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's your call

Probably a better way to put that would have been: I don't have a strong feeling about it, but it did take me quite a long time to figure out why geoarrow-c was crashing causing pyarrow to crash 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can imagine. Thanks for the suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add a note in the comments and suggest for maximum compatibility extension types should always return Some("")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be in conflict with the docs for deserialize_metadata:

An extension type that defines no metadata should expect None for the serialized metadata and return Ok(()).

This function should return an error when unexpected metadata is set (for extension types without metadata)

We could add a note as you suggested for serialize_metadata and then modify the comments in deserialize_metadata to be suggestions only?

The Uuid canonical extension type requires no metadata, so if we change the recommendation here we should also change that implementation maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least Arrow C++ and DuckDB will export ARROW:extension:metadata as an empty string if there are no parameters, so something will have to handle that (if they are interested in interacting with other Arrow implementations). In GeoArrow I personally accept "{}", "" or missing metadata.

///
/// # Example
///
/// The example below demonstrates how to implement this trait for a `Uuid`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a really nice example and illustrates how this API works

///
/// If an extension type defines no metadata it should use `()` to indicate
/// this.
type Metadata;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what is the usecase for the Metadata indirection?

For instance, the example above seems like it would be simpler if it simply serialized / deserialized UuidVersion directly

I think that would mean the API would look like

trait ExtensionType {
  // serialize self to a string
  fn serialize(&self) -> Option<String>;
  // deserialize into self
  fn deserialize(metadata: Option<&str>) -> Result<Self, ArrowError>;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Json, serde_json::Value is the metadata instead of String, with your design I think we are not able to store metadata other than String

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think any metadata needs to be serialized to a string (which is the same as what the PR does)

    fn serialize_metadata(&self) -> Option<String>;

I am just wondering why have a separate Rust type for Metadata

For eample, if I were doing this my first inclination would be to do

impl ExtensionType for MyAwesomeType {
    ...
    type Metadata = Self;
    ....
}

So then I could just serialize/deserialize instances of my awesome type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The only benefit is that we can have a separate clean struct IF the metadata becomes pretty complex. For usual case, simply se/de without type Metadata is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work for all extension types. For example the FixedShapeTensor and VariableShapeTensor canonical extension types can't be constructed from just the metadata, they also need the field datatype (which explains the ExtensionType::try_new signature).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it. Maybe we can add some comments explaining the rationale a bit more.

Or maybe we can default the Metadata type to Self

type Metadata = Self;

To be clear we can do this as a follow on PR (or never). I don't think it is necessary for this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately associated type defaults are not stabilized yet. I added a note in b78e692.


//! Extension types.

#[cfg(feature = "canonical-extension-types")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 for making this behind a feature flag

It would be nice to add this to the documentation:

Perhaps we could also note it is "experimental" and say we reserve the right to make breaking changes between even minor version releases -- that would give us flexibilty to iterate in the short term if necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

069642f

Perhaps we could also note it is "experimental"

Do you want to add an experimental feature to the arrow crate (same as in the parquet crate)?

arrow-rs/parquet/Cargo.toml

Lines 112 to 113 in 1664214

# Experimental, unstable functionality primarily used for testing
experimental = []

Or just a note in the module docs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a note in the module docs is good enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a warning in 5fec56d.

/// Canonical extension types.
///
/// <https://arrow.apache.org/docs/format/CanonicalExtensions.html#format-canonical-extensions>
#[non_exhaustive]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for non exhaustive so we can add new extension types without breaking API changes

@emilk
Copy link
Contributor

emilk commented Jan 22, 2025

This looks great to me! I really like this approach

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks pretty nice!

Copy link
Contributor

@kylebarron kylebarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops forgot to commit changes

Comment on lines +2253 to +2258
// TODO: roundtrip
// let arrow_schema = parquet_to_arrow_schema(&parquet_schema, None)?;
// assert_eq!(
// arrow_schema.field(0).try_extension_type::<Json>()?,
// Json::default()
// );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this left for a future PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this requires a bit more work to propagate correctly.

///
/// // We just return a reference to the Uuid version.
/// fn metadata(&self) -> &Self::Metadata {
/// &self.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first go at implementing this trait I used:

pub struct PointType(CoordType, Dimension);

impl ExtensionType for PointType {
    const NAME: &'static str = "geoarrow.point";

    type Metadata = (CoordType, Dimension);

    fn metadata(&self) -> &Self::Metadata {
        &(self.0, self.1)
    }
}

But that won't work for this because cannot return reference to temporary value.

So essentially type Metadata needs to be the same as whatever is defined inside the extension type struct, and that needs to be a single item, so that it can return Self::Metadata without creating any new items?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the current signature requires that. We could change it to return Self::Metadata?

The pattern I used for the canonical extension types with metadata applied to your example:

pub struct PointType(PointTypeMetadata);

pub struct PointTypeMetadata(CoordType, Dimension);

impl ExtensionType for PointType {
    const NAME: &'static str = "geoarrow.point";

    type Metadata = PointTypeMetadata;

    fn metadata(&self) -> &Self::Metadata {
        &self.0
    }
}

///
/// [`Field`]: crate::Field
/// [`Field::metadata`]: crate::Field::metadata
fn serialize_metadata(&self) -> Option<String>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting because it requires the type to contain all information to construct its Arrow extension metadata. This is not how I've implemented it so far in geoarrow-rs. For example, I define a Point type as parameterized only by its "coordinate type" (i.e. FixedSizeList coordinates or Struct[x: Float, y: Float] coordinates) and its dimension (XY, XYZ, and so on). Separately, I store more metadata, and each array type stores both pieces of data.

So the data type alone is not enough information to serialize the full GeoArrow metadata, which also needs the Coordinate Reference System (CRS). It's a good question whether the CRS should be a part of the type here. Perhaps it should, but then we need to support CRS equality, which is quite tricky to do, so a single logical CRS can be represented in different ways.

cc @paleolimbot , this might be a question for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is great, I'll take a better look at this tomorrow and figure out if the current definition works with GeoArrow.
Note that in the spec extension types are defined for Field, not for Array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider the CRS a property of the type in geoarrow-c and geoarrow-python (which is also how it is implemented in the draft Parquet LogicalType, GeoParquet, and the draft Iceberg spec). My reading of this is that it's compatible with GeoArrow (and will be great!). (A Python implementation is here: https://github.com/geoarrow/geoarrow-python/blob/main/geoarrow-types/src/geoarrow/types/type_pyarrow.py ).

It's true that CRS equality is thorny (generally it's expressed as a probability)...I implement this as part of the type and implement == on the type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like all the pieces are there (in geoarrow-rs), so I guess it will be straightforward to add ExtensionType implementations that can be used with the added extension type related methods of Field.

Copy link

@Tpt Tpt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just passing by and had a lot of fun reading this code (thank you)!

Please just ignore my comments if you feel so.

impl ExtensionType for Bool8 {
const NAME: &'static str = "arrow.bool8";

type Metadata = &'static str;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe bad idea: use () here. This way it's not possible to encode bad metadata value (any string that is not empty)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the proposed definition of the trait, () would indicate no metadata and results in the metadata key to be in unset in the field custom metadata map. Note that there is no actual storage for the metadata in this extension type. Both serialization and deserialization make sure the metadata is always an empty string.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! Missed that! Thanks!

"" => Ok(""),
_ => Err(ArrowError::InvalidArgumentError(ERR.to_owned())),
},
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly shorter:

Suggested change
)
if metadata.map_or(false, str::is_empty) {
Ok("")
} else {
Err(ArrowError::InvalidArgumentError("Bool8 extension type expects an empty string as metadata".into()))
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

///
/// This type does not have any parameters.
///
/// Metadata is either an empty string or a JSON string with an empty
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on implementation I guess it's not "an empty string" but "a JSON string with an empty string" (ie "\"\"" and not "")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! 29a94cb

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe bad idea: instead of an Option<Map<...>> what about just using a is_empty_object: bool field or something like this to make invalid states unrepresentable?

Also, you might even hack something like this to only deserialize the empty object:

#[derive(Deserialize)]
struct Empty {}

let is_empty_struct = serde_json::from_str<Empty>(value).is_ok()

and then use the Option<Empty> type in JsonMetadata

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider doing this. The spec states (https://arrow.apache.org/docs/format/CanonicalExtensions.html#json):

Description of the serialization:
Metadata is either an empty string or a JSON string with an empty object. In the future, additional fields may be added, but they are not required to interpret the array.

So I figured, I'll support the future use-case, however, if/when in the future fields are added here it would actually be better to use a strongly typed definition. I like your suggestion of using Empty here and since the metadata content currently is not pub, if fields are added in the future, we can just rename and modify the definition of the inner metadata type without introducing a breaking change.

c6f0443

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! It might be possible to also use a similar approach for the Bool8 metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is currently no metadata storage for Bool8, and the difference is that there is no note about future additions to the metadata serialization:

Description of the serialization:
Metadata is an empty string.

So, if there are breaking changes in the spec for this type, it would have to be a breaking change here too anyway.

@alamb
Copy link
Contributor

alamb commented Jan 24, 2025

I think we should leave this PR open at least through the weekend to gather additional feedback. I think given the size/scope of it perhaps we can plan to merge it after I cut the release candidate for the next arrow-rs release (mid-late next week) if there are no other concerns / comments.

THank you again @mbrobbel for pushing this along

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an ExtensionType to DataType enum
9 participants