diff --git a/Geospatial.md b/Geospatial.md new file mode 100644 index 00000000..3ecb821a --- /dev/null +++ b/Geospatial.md @@ -0,0 +1,171 @@ + + +Geospatial Definitions +==== + +This document contains the specification of geospatial types and statistics. + +# Background + +The Geometry and Geography class hierarchy and its Well-Known Text (WKT) and +Well-Known Binary (WKB) serializations (ISO supporting XY, XYZ, XYM, XYZM) are +defined by [OpenGIS Implementation Specification for Geographic information – +Simple feature access – Part 1: Common architecture][sfa-part1], from [OGC +(Open Geospatial Consortium)][ogc]. + +The version of the OGC standard first used here is 1.2.1, but future versions +may also used if the WKB representation remains wire-compatible. + +[sfa-part1]: https://portal.ogc.org/files/?artifact_id=25355 +[ogc]: https://www.ogc.org/standard/sfa/ + +## Well-Known Binary + +Well-Known Binary (WKB) representations of geometries. + +Apache Parquet follows the same definitions of GeoParquet for [WKB][geoparquet-wkb] +and [coordinate axis order][coordinate-axis-order]: +- Geometries should be encoded as ISO WKB supporting XY, XYZ, XYM, XYZM. Supported +standard geometry types: Point, LineString, Polygon, MultiPoint, MultiLineString, +MultiPolygon, and GeometryCollection. +- Coordinate axis order is always (x, y) where x is easting or longitude, and +y is northing or latitude. This ordering explicitly overrides the axis order +as specified in the CRS following the [GeoPackage specification][geopackage-spec]. + +[geoparquet-wkb]: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L92 +[coordinate-axis-order]: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L155 +[geopackage-spec]: https://www.geopackage.org/spec130/#gpb_spec + +## Coordinate Reference System + +Coordinate Reference System (CRS) is a mapping of how coordinates refer to +locations on Earth. + +Apache Parquet supports CRS Customization by providing following attributes: +* `crs`: a CRS text representation. If unset, the CRS defaults to "OGC:CRS84". +* `crs_encoding`: a standard encoding used to represent the CRS text. If unset, + `crs` can be arbitrary string. + +For maximum interoperability of a custom CRS, it is recommended to provide +the CRS text with a standard encoding. Supported CRS encodings are: +* `SRID`: [Spatial reference identifier][srid], CRS text is the identifier itself. +* `PROJJSON`: [PROJJSON][projjson], CRS text is the projjson string. + +For example, if a Geometry or Geography column uses the CRS "OGC:CRS84", a writer +may write a PROJJSON representation of [OGC:CRS84][ogc-crs84] to the `crs` field +and set the `crs_encoding` field to `PROJJSON`. + +[srid]: https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier +[projjson]: https://proj.org/en/stable/specifications/projjson.html +[ogc-crs84]: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#ogccrs84-details + +## Edge Interpolation Algorithm + +The edge interpolation algorithm is used for interpreting edges of elements of +a Geography column. It is applies to all non-point geometry objects and is +independent of the [Coordinate Reference System](#coordinate-reference-system). + +Supported values are: +* `spherical`: edges are interpolated as geodesics on a sphere. The radius of the underlying sphere is the mean radius of the spheroid defined by the CRS, defined as (2 * major_axis_length + minor_axis_length / 3). +* `vincenty`: [https://en.wikipedia.org/wiki/Vincenty%27s_formulae](https://en.wikipedia.org/wiki/Vincenty%27s_formulae) +* `thomas`: Thomas, Paul D. Spheroidal geodesics, reference systems, & local geometry. US Naval Oceanographic Office, 1970. +* `andoyer`: Thomas, Paul D. Mathematical models for navigation systems. US Naval Oceanographic Office, 1965. +* `karney`: [Karney, Charles FF. "Algorithms for geodesics." Journal of Geodesy 87 (2013): 43-55](https://link.springer.com/content/pdf/10.1007/s00190-012-0578-z.pdf), and [GeographicLib](https://geographiclib.sourceforge.io/) + +# Logical Types + +Apache Parquet supports the following geospatial logical type annotations: +* `GEOMETRY`: Geometry features in the WKB format with linear/planar edges interpolation. See [Geometry logical type](LogicalTypes.md#geometry) +* `GEOGRAPHY`: Geometry features in the WKB format with non-linear/non-planar edges interpolation. See [Geography logical type](LogicalTypes.md#geography) + +# Statistics + +`GeometryStatistics` is a struct specific for `GEOMETRY` and `GEOGRAPHY` logical +types to store statistics of a column chunk. It is an optional field in the +`ColumnMetaData` and contains [Bounding Box](#bounding-box) and [Geometry +Types](#geometry-types). + +## Bounding Box + +A geometry has at least two coordinate dimensions: X and Y for 2D coordinates +of each point. A geometry can optionally have Z and / or M values associated +with each point in the geometry. + +The Z values introduce the third dimension coordinate. Usually they are used to +indicate the height, or elevation. + +M values are an opportunity for a geometry to express a fourth dimension as a +coordinate value. These values can be used as a linear reference value (e.g., +highway milepost value), a timestamp, or some other value as defined by the CRS. + +Bounding box is defined as the thrift struct below in the representation of +min/max value pair of coordinates from each axis. Note that X and Y Values are +always present. Z and M are omitted for 2D geometries. The concepts of westmost +and eastmost values are explicitly introduced for Geography logical type to +address cases involving antimeridian crossing, where xmin may be greater than +xmax. + +```thrift +struct BoundingBox { + /** Min X value for Geometry logical type, westmost value for Geography logical type */ + 1: required double xmin; + /** Max X value for Geometry logical type, eastmost value for Geography logical type */ + 2: required double xmax; + /** Min Y value for Geometry logical type, southmost value for Geography logical type */ + 3: required double ymin; + /** Max Y value for Geometry logical type, northmost value for Geography logical type */ + 4: required double ymax; + /** Min Z value if the axis exists */ + 5: optional double zmin; + /** Max Z value if the axis exists */ + 6: optional double zmax; + /** Min M value if the axis exists */ + 7: optional double mmin; + /** Max M value if the axis exists */ + 8: optional double mmax; +} +``` + +## Geometry Types + +A list of geometry types from all geometries in the `GEOMETRY` or `GEOGRAPHY` +column, or an empty list if they are not known. + +This is borrowed from [geometry_types of GeoParquet][geometry-types] except that +values in the list are [WKB (ISO-variant) integer codes][wkb-integer-code]. +Table below shows the most common geometry types and their codes: + +| Type | XY | XYZ | XYM | XYZM | +| :----------------- | :--- | :--- | :--- | :--: | +| Point | 0001 | 1001 | 2001 | 3001 | +| LineString | 0002 | 1002 | 2002 | 3002 | +| Polygon | 0003 | 1003 | 2003 | 3003 | +| MultiPoint | 0004 | 1004 | 2004 | 3004 | +| MultiLineString | 0005 | 1005 | 2005 | 3005 | +| MultiPolygon | 0006 | 1006 | 2006 | 3006 | +| GeometryCollection | 0007 | 1007 | 2007 | 3007 | + +In addition, the following rules are applied: +- A list of multiple values indicates that multiple geometry types are present (e.g. `[0003, 0006]`). +- An empty array explicitly signals that the geometry types are not known. +- The geometry types in the list must be unique (e.g. `[0001, 0001]` is not valid). + +[geometry-types]: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L159 +[wkb-integer-code]: https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary diff --git a/LogicalTypes.md b/LogicalTypes.md index 3aa5ceb9..5c3ddcf4 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -599,6 +599,51 @@ optional group variant_shredded (VARIANT) { } ``` +### GEOMETRY + +`GEOMETRY` is used for geometry features in the Well-Known Binary (WKB) format +with linear/planar edges interpolation. See [Geospatial.md](Geospatial.md) for +more detail. + +The type has two type parameters: +- `crs`: An optional string value for Coordinate Reference System (CRS), which + is a mapping of how coordinates refer to locations on Earth. If unset, the CRS + defaults to "OGC:CRS84", which means that the geometries must be stored in + longitude, latitude based on the WGS84 datum. +- `crs_encoding`: An optional enum value to describes the encoding used by the + `crs` field. Supported values are: `SRID`, `PROJJSON`. If unset, `crs` can be + arbitrary string. + +The sort order used for `GEOMETRY` is undefined. When writing data, no min/max +statistics should be saved for this type and if such non-compliant statistics +are found during reading, they must be ignored. + +[`GeometryStatistics`](Geospatial.md#statistics) is introduced to store statistics +for `GEOMETRY` type. + +### GEOGRAPHY + +`GEOGRAPHY` is used for geography features in the WKB format with non-linear/non-planar +edges interpolation. + +The type has three type parameters: +- `crs`: An optional string value for CRS, similar to `GEOMETRY` type. It must + be a geographic CRS, where longitudes are bound by [-180, 180] and latitudes + are bound by [-90, 90]. +- `crs_encoding`: An optional enum value, similar to `GEOMETRY` type. +- `algorithm`: A required enum value to describes the edge interpolation + algorithm. Supported values are: `SPHERICAL`, `VINCENTY`, `THOMAS`, `ANDOYER`, + `KARNEY`. In order to correctly interpret edges interpolation of the geometries, + writer implementations should always set it and reader implementations should + fail for unknown values. + +The sort order used for `GEOGRAPHY` is undefined. When writing data, no min/max +statistics should be saved for this type and if such non-compliant statistics +are found during reading, they must be ignored. + +[`GeometryStatistics`](Geospatial.md#statistics) is introduced to store statistics +for `GEOGRAPHY` type. + ## Nested Types This section specifies how `LIST` and `MAP` can be used to encode nested types diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 5d4431d9..8cc80fb8 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -237,6 +237,37 @@ struct SizeStatistics { 3: optional list definition_level_histogram; } +/** + * Bounding box of geometries in the representation of min/max value pair of + * coordinates from each axis. + */ +struct BoundingBox { + /** Min X value for Geometry logical type, westmost value for Geography logical type */ + 1: required double xmin; + /** Max X value for Geometry logical type, eastmost value for Geography logical type */ + 2: required double xmax; + /** Min Y value for Geometry logical type, southmost value for Geography logical type */ + 3: required double ymin; + /** Max Y value for Geometry logical type, northmost value for Geography logical type */ + 4: required double ymax; + /** Min Z value if the axis exists */ + 5: optional double zmin; + /** Max Z value if the axis exists */ + 6: optional double zmax; + /** Min M value if the axis exists */ + 7: optional double mmin; + /** Max M value if the axis exists */ + 8: optional double mmax; +} + +/** Statistics specific to Geometry and Geography logical types */ +struct GeometryStatistics { + /** A bounding box of geometries */ + 1: optional BoundingBox bbox; + /** Geometry type codes of all geometries, or an empty list if not known */ + 2: optional list geometry_types; +} + /** * Statistics per row group and per page * All fields are optional. @@ -386,6 +417,64 @@ struct BsonType { struct VariantType { } +/** Coordinate reference system (CRS) encoding for Geometry and Geography logical types */ +enum CRSEncoding { + SRID = 0; + PROJJSON = 1; +} + +/** Edge interpolation algorithm for Geography logical type */ +enum EdgeInterpolationAlgorithm { + SPHERICAL = 0; + VINCENTY = 1; + THOMAS = 2; + ANDOYER = 3; + KARNEY = 4; +} + +/** + * Embedded Geometry logical type annotation + * + * Geometry features in the Well-Known Binary (WKB) format with linear/planar + * edges interpolation. + * + * A custom CRS can be set to the crs field. If unset, the CRS defaults to + * "OGC:CRS84", which means that the geometries must be stored in longitude, + * latitude based on the WGS84 datum. + * + * crs_encoding is an auxillary field to help decode the crs text. If unset, the + * crs field can be arbitrary text. + * + * Allowed for physical type: BYTE_ARRAY. + */ +struct GeometryType { + 1: optional string crs; + 2: optional CRSEncoding crs_encoding; +} + +/** + * Embedded Geography logical type annotation + * + * Geometry features in the WKB format with non-linear/non-planar edges + * interpolation. + * + * Similar to the Geometry logical type, a custom CRS can be set to the crs and + * crs_encoding fields. However, Geography logical type must use a geographic + * CRS, where longitudes are bound by [-180, 180] and latitudes are bound by + * [-90, 90]. + * + * algorithm is required. In order to correctly interpret edges interpolation + * of the geometries, writer implementations should always set it and reader + * implementations should fail for unknown values. + * + * Allowed for physical type: BYTE_ARRAY. + */ +struct GeographyType { + 1: optional string crs; + 2: optional CRSEncoding crs_encoding; + 3: required EdgeInterpolationAlgorithm algorithm; +} + /** * LogicalType annotations to replace ConvertedType. * @@ -417,6 +506,7 @@ union LogicalType { 14: UUIDType UUID // no compatible ConvertedType 15: Float16Type FLOAT16 // no compatible ConvertedType 16: VariantType VARIANT // no compatible ConvertedType + 17: GeometryType GEOMETRY // no compatible ConvertedType } /** @@ -857,6 +947,9 @@ struct ColumnMetaData { * filter pushdown. */ 16: optional SizeStatistics size_statistics; + + /** Optional statistics specific for Geometry and Geography logical types */ + 17: optional GeometryStatistics geometry_statistics; } struct EncryptionWithFooterKey { @@ -988,6 +1081,8 @@ union ColumnOrder { * LIST - undefined * MAP - undefined * VARIANT - undefined + * GEOMETRY - undefined + * GEOGRAPHY - undefined * * In the absence of logical types, the sort order is determined by the physical type: * BOOLEAN - false, true