diff --git a/LogicalTypes.md b/LogicalTypes.md index b55a90884..bcc20c9da 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -767,6 +767,72 @@ optional group my_map (MAP_KEY_VALUE) { } ``` +## EXTENSION + +Extension types allow the Parquet type system to be open-ended. An extension +type can be used to signal a third-party type that has no equivalent in the +core Parquet type system. + +Extension types will typically be specified by third-party communities, or +be vendor-specific. An extension type specification will typically contain +the following elements: + +1. The extension type must be identified by a dotted name with the first name + component clearly denoting the authority that defined the type. The + `parquet.` namespace is reserved for use by the Parquet community and + must not be used for third-party extension types. + +2. The extension type must define which parameters it takes, if any. It must + define a binary serialization to store those parameters in the Parquet schema. + It is recommended (but not required) that the serialization is a UTF-8 encoding + of a JSON object. + +3. The extension type must define which kind of node it annotates: leaf + or non-leaf. If non-leaf, the allowed subtree shape must be defined. + +4. If the extension type annotates leaf nodes, it must define the allowed + physical type(s). + +5. If the extension type annotates leaf nodes, it should also optionally + define its sort order (see the `ColumnOrder` definition in the Thrift + format). If it does not, then the extension type is unordered. + +### Reading extension types + +An extension type is identified by its name. A reader will typically have +a collection of extension types that it knows about; it may also offer a way +for the user to register additional extension types. + +When a reader encounters an extension type in a Parquet schema, it should try +to match it by name to its known extension types. If it does not recognize +the extension type, then it should read it as the underlying physical type +and should not try to interpret the column's statistics. It may however +preserve the extension type information when transmitting the data to other +systems, or for round-tripping purposes. + +### Examples + +The fictional ParquetNet community defines a IPv6 extension type +with the following characteristics: + +1. Name: `parquetnet.ipv6` +2. Parameters: none, the serialization is always empty +3. Node type: only leaf +4. Physical type: only FIXED_LEN_BYTE_ARRAY(16) +5. Sort order: binary lexicographic order (the IP addresses use big-endian encoding) + +The fictional ParquetScience community defines a double-precision fixed-shaped +tensor type with the following characteristics: + +1. Name: `parquetsci.f64tensor` +2. Parameters: the number of dimensions `ndim` (an integer), and the shape of the + tensor elements (a tuple of `ndim` integers). It is serialized as a JSON + object thusly: `{"ndim": 3, "shape": (4, 5, 6)}` +3. Node type: only leaf +4. Physical type: only FIXED_LEN_BYTE_ARRAY(nbytes) where `nbytes` is 8 times + the shape's product +5. Sort order: unordered + ## UNKNOWN (always null) Sometimes, when discovering the schema of existing data, values are always null diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 83457fe29..cd5acd519 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -288,6 +288,27 @@ struct Statistics { 8: optional bool is_min_value_exact; } +/** + * An extension type description + * + * Extension types allow for third-party semantics not provided by the core + * Parquet type system. + * + * `name` is a dotted name reliably identifying the extension type. + * Names beginning with "parquet." are reserved for standardization within + * the Parquet project. + * + * If the extension type is parametric, then `serialization` is an encoding + * of the extension type's parameters. It is recommended (but not required) + * that the parameters are serialized as a JSON object in UTF-8 encoding. + * + * If the extension type is not parametric, then `serialization` is empty. + */ +struct ExtensionTypeDescription { + 1: required string name + 2: optional binary serialization +} + /** Empty structs to use as logical type annotations */ struct StringType {} // allowed for BYTE_ARRAY, must be encoded with UTF-8 struct UUIDType {} // allowed for FIXED[16], must encoded raw UUID bytes @@ -380,6 +401,21 @@ struct JsonType { struct BsonType { } +/** + * Extension type annotation + * + * `type_index` is an index into `FileMetaData.extension_types`. This + * indirection allows for efficient representation of schemas with many + * columns of a given extension type. + * + * Each extension type specification will define the set of allowed physical + * types (for example, a hypothetical IPv6 extension type would require + * FIXED_LEN_BYTE_ARRAY(16)). + */ +struct ExtensionType { + 1: required i32 type_index +} + /** * LogicalType annotations to replace ConvertedType. * @@ -410,6 +446,7 @@ union LogicalType { 13: BsonType BSON // use ConvertedType BSON 14: UUIDType UUID // no compatible ConvertedType 15: Float16Type FLOAT16 // no compatible ConvertedType + 16: ExtensionType EXTENSION // no compatible ConvertedType } /** @@ -956,7 +993,6 @@ struct TypeDefinedOrder {} * for this column should be ignored. */ union ColumnOrder { - /** * The sort orders for logical types are: * UTF8 - unsigned byte-wise comparison @@ -980,6 +1016,7 @@ union ColumnOrder { * ENUM - unsigned byte-wise comparison * LIST - undefined * MAP - undefined + * EXTENSION - extension type-specific * * In the absence of logical types, the sort order is determined by the physical type: * BOOLEAN - false, true @@ -1211,6 +1248,13 @@ struct FileMetaData { * Used only in encrypted files with plaintext footer. */ 9: optional binary footer_signing_key_metadata + + /** + * A list of all extension types used in the Parquet schema, if any. + * The entries in this list are referenced through the `ExtensionType.type_index` + * of each ExtensionType field. + */ + 10: optional list extension_types } /** Crypto metadata for files with encrypted footer **/