Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2471: Add GEOMETRY and GEOGRAPHY logical types #240

Open
wants to merge 40 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
5c9e110
WIP: Add geometry logical type
wgtmac May 10, 2024
5ef28cd
address various comments
wgtmac May 25, 2024
ecd8cc2
add file level geo stats
wgtmac May 27, 2024
d81dacb
address feedback:
wgtmac May 31, 2024
80f4051
change naming and remove controversial items
wgtmac Jun 13, 2024
0db6d9f
address feedback
wgtmac Jun 16, 2024
e817af4
fix typo
wgtmac Jun 16, 2024
f78f7bd
use WKB type code
wgtmac Jun 19, 2024
1aaaca8
Update covering and geometry type protocol based on comments (#2)
zhangfengcdt Aug 7, 2024
ee5b2df
Add the new suggestion according to the meeting with Snowflake (#3)
jiayuasu Aug 15, 2024
19cc081
change metadata to string type and rewording WKB description
wgtmac Aug 20, 2024
16c5868
add example for crs
wgtmac Aug 21, 2024
56a65de
reword crs
wgtmac Aug 21, 2024
f28b282
clarify WKB
wgtmac Aug 22, 2024
5127702
clarify coverings
wgtmac Aug 24, 2024
298ab64
Update the suggestion for bbox stats (#4)
jiayuasu Sep 11, 2024
41c6394
Update src/main/thrift/parquet.thrift
wgtmac Sep 20, 2024
d86abe4
Update src/main/thrift/parquet.thrift
wgtmac Sep 20, 2024
c7a4f4c
Update src/main/thrift/parquet.thrift
wgtmac Sep 20, 2024
f20f685
Update src/main/thrift/parquet.thrift
wgtmac Sep 20, 2024
dbf9d54
address feedback about edges and wkb
wgtmac Sep 20, 2024
b4296aa
add geoparquet column metadata back
wgtmac Sep 27, 2024
9bcea6e
Update the spec according to the new feedback (#5)
jiayuasu Oct 4, 2024
99f0403
Update src/main/thrift/parquet.thrift
wgtmac Oct 12, 2024
dbb78cf
Update src/main/thrift/parquet.thrift
wgtmac Oct 12, 2024
25df0ff
add description to LogicalTypes.md
wgtmac Oct 13, 2024
d349727
add explanation for Z & M values
wgtmac Oct 13, 2024
9ea6559
move geo stats to ColumnMetaData
wgtmac Oct 16, 2024
011de45
Update src/main/thrift/parquet.thrift
wgtmac Oct 17, 2024
6425a3c
fix typo
wgtmac Oct 17, 2024
7d8ffa5
Merge branch 'master' into geo
wgtmac Nov 7, 2024
1502458
remove edges and simplify crs
wgtmac Nov 22, 2024
9f53c9e
Add geography type
wgtmac Dec 13, 2024
a4f79ca
remove wrong content
wgtmac Dec 13, 2024
9fac5e7
sync with the iceberg pr
wgtmac Jan 16, 2025
eb260e6
address comment
wgtmac Jan 16, 2025
bf84d7e
Update Geospatial.md
wgtmac Jan 18, 2025
3845edf
Update src/main/thrift/parquet.thrift
wgtmac Jan 18, 2025
41d9cd7
make algorithm optional
wgtmac Jan 24, 2025
082e31d
use single string field for crs
wgtmac Jan 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 144 additions & 0 deletions Geospatial.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
<!--
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
-->

Geospatial Definitions
====

This document contains the specification of geospatial types and statistics.

# Background

The Geometry and Geography class hierarchy and its Well-Known Text (WKT) and
Well-Known Binary (WKB) serializations (ISO supporting XY, XYZ, XYM, XYZM) are
defined by [OpenGIS Implementation Specification for Geographic information –
Simple feature access – Part 1: Common architecture][sfa-part1], from [OGC
(Open Geospatial Consortium)][ogc].

The version of the OGC standard first used here is 1.2.1, but future versions
may also be used if the WKB representation remains wire-compatible.

[sfa-part1]: https://portal.ogc.org/files/?artifact_id=25355
[ogc]: https://www.ogc.org/standard/sfa/

## Coordinate Reference System

Coordinate Reference System (CRS) is a mapping of how coordinates refer to
locations on Earth.

The default CRS `OGC:CRS84` means that the objects must be stored in longitude,
latitude based on the WGS84 datum.

Custom CRS can be specified by a string value. It is recommended to use the
identifier of the CRS like [Spatial reference identifier][srid] and [PROJJSON][projjson].

For geographic CRS, longitudes are bound by [-180, 180] and latitudes are bound
by [-90, 90].

[srid]: https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier
[projjson]: https://proj.org/en/stable/specifications/projjson.html

## Edge Interpolation Algorithm

An algorithm for interpolating edges, and is one of the following values:

* `spherical`: edges are interpolated as geodesics on a sphere.
* `vincenty`: [https://en.wikipedia.org/wiki/Vincenty%27s_formulae](https://en.wikipedia.org/wiki/Vincenty%27s_formulae)
* `thomas`: Thomas, Paul D. Spheroidal geodesics, reference systems, & local geometry. US Naval Oceanographic Office, 1970.
* `andoyer`: Thomas, Paul D. Mathematical models for navigation systems. US Naval Oceanographic Office, 1965.
* `karney`: [Karney, Charles FF. "Algorithms for geodesics." Journal of Geodesy 87 (2013): 43-55](https://link.springer.com/content/pdf/10.1007/s00190-012-0578-z.pdf), and [GeographicLib](https://geographiclib.sourceforge.io/)

# Logical Types

Two geospatial logical type annotations are supported:
* `GEOMETRY`: Geometry features in the WKB format with linear/planar edges interpolation. See [Geometry](LogicalTypes.md#geometry)
* `GEOGRAPHY`: Geography features in the WKB format with an explicit (non-linear/non-planar) edges interpolation algorithm. See [Geography](LogicalTypes.md#geography)

# Statistics

`GeometryStatistics` is a struct specific for `GEOMETRY` and `GEOGRAPHY` logical
types to store statistics of a column chunk. It is an optional field in the
`ColumnMetaData` and contains [Bounding Box](#bounding-box) and [Geometry
Types](#geometry-types) that are described below in detail.

## Bounding Box

A geometry has at least two coordinate dimensions: X and Y for 2D coordinates
of each point. A geometry can optionally have Z and / or M values associated
with each point in the geometry.

The Z values introduce the third dimension coordinate. Usually they are used to
indicate the height, or elevation.

M values are an opportunity for a geometry to express a fourth dimension as a
coordinate value. These values can be used as a linear reference value (e.g.,
highway milepost value), a timestamp, or some other value as defined by the CRS.

Bounding box is defined as the thrift struct below in the representation of
min/max value pair of coordinates from each axis. Note that X and Y Values are
always present. Z and M are omitted for 2D geometries.

For the X and Y values only, (xmin/ymin) may be greater than (xmax/ymax). In this
X case, an object in this bounding box may match if it contains an X such that
`x >= xmin` OR `x <= xmax`, and in this Y case if `y >= ymin` OR `y <= ymax`.
In geographic terminology, the concepts of `xmin`, `xmax`, `ymin`, and `ymax`
are also known as `westernmost`, `easternmost`, `southernmost` and `northernmost`,
respectively.

For `GEOGRAPHY` types, X and Y values are restricted to the canonical ranges of
[-180, 180] for X and [-90, 90] for Y.

```thrift
struct BoundingBox {
1: required double xmin;
2: required double xmax;
3: required double ymin;
4: required double ymax;
5: optional double zmin;
6: optional double zmax;
7: optional double mmin;
8: optional double mmax;
}
```

## Geometry Types

A list of geometry types from all geometries in the `GEOMETRY` or `GEOGRAPHY`
column, or an empty list if they are not known.

This is borrowed from [geometry_types of GeoParquet][geometry-types] except that
values in the list are [WKB (ISO-variant) integer codes][wkb-integer-code].
Table below shows the most common geometry types and their codes:

| Type | XY | XYZ | XYM | XYZM |
| :----------------- | :--- | :--- | :--- | :--: |
| Point | 0001 | 1001 | 2001 | 3001 |
| LineString | 0002 | 1002 | 2002 | 3002 |
| Polygon | 0003 | 1003 | 2003 | 3003 |
| MultiPoint | 0004 | 1004 | 2004 | 3004 |
| MultiLineString | 0005 | 1005 | 2005 | 3005 |
| MultiPolygon | 0006 | 1006 | 2006 | 3006 |
| GeometryCollection | 0007 | 1007 | 2007 | 3007 |

In addition, the following rules are applied:
- A list of multiple values indicates that multiple geometry types are present (e.g. `[0003, 0006]`).
- An empty array explicitly signals that the geometry types are not known.
- The geometry types in the list must be unique (e.g. `[0001, 0001]` is not valid).

[geometry-types]: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L159
[wkb-integer-code]: https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary
33 changes: 33 additions & 0 deletions LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -599,6 +599,39 @@ optional group variant_shredded (VARIANT) {
}
```

### GEOMETRY

`GEOMETRY` is used for geometry features in the Well-Known Binary (WKB) format
with linear/planar edges interpolation. It must annotate a `BYTE_ARRAY`
primitive type. See [Geospatial.md](Geospatial.md) for more detail.

The type has only one type parameter:
- `crs`: An optional string value for CRS. If unset, the CRS defaults to
`"OGC:CRS84"`, which means that the geometries must be stored in longitude,
latitude based on the WGS84 datum.

The sort order used for `GEOMETRY` is undefined. When writing data, no min/max
statistics should be saved for this type and if such non-compliant statistics
are found during reading, they must be ignored.

### GEOGRAPHY

`GEOGRAPHY` is used for geography features in the WKB format with an explicit
(non-linear/non-planar) edges interpolation algorithm. It must annotate a
`BYTE_ARRAY` primitive type. See [Geospatial.md](Geospatial.md) for more detail.

The type has two type parameters:
- `crs`: An optional string value for CRS. It must be a geographic CRS, where
longitudes are bound by [-180, 180] and latitudes are bound by [-90, 90].
If unset, the CRS defaults to `"OGC:CRS84"`.
- `algorithm`: An optional enum value to describes the edge interpolation
algorithm. Supported values are: `SPHERICAL`, `VINCENTY`, `THOMAS`, `ANDOYER`,
`KARNEY`. If unset, the algorithm defaults to `SPHERICAL`.

The sort order used for `GEOGRAPHY` is undefined. When writing data, no min/max
statistics should be saved for this type and if such non-compliant statistics
are found during reading, they must be ignored.

## Nested Types

This section specifies how `LIST` and `MAP` can be used to encode nested types
Expand Down
85 changes: 85 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,29 @@ struct SizeStatistics {
3: optional list<i64> definition_level_histogram;
}

/**
* Bounding box of geometries in the representation of min/max value pair of
* coordinates from each axis.
*/
struct BoundingBox {
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
1: required double xmin;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
2: required double xmax;
3: required double ymin;
4: required double ymax;
5: optional double zmin;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
6: optional double zmax;
7: optional double mmin;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
8: optional double mmax;
}

/** Statistics specific to Geometry and Geography logical types */
struct GeometryStatistics {
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
/** A bounding box of geometries */
1: optional BoundingBox bbox;
/** Geometry type codes of all geometries, or an empty list if not known */
2: optional list<i32> geometry_types;
}

/**
* Statistics per row group and per page
* All fields are optional.
Expand Down Expand Up @@ -386,6 +409,61 @@ struct BsonType {
struct VariantType {
}

/** Coordinate reference system (CRS) encoding for Geometry and Geography logical types */
enum CRSEncoding {
SRID = 0;
PROJJSON = 1;
}

/** Edge interpolation algorithm for Geography logical type */
enum EdgeInterpolationAlgorithm {
SPHERICAL = 0;
VINCENTY = 1;
THOMAS = 2;
ANDOYER = 3;
KARNEY = 4;
}

/**
* Embedded Geometry logical type annotation
*
* Geometry features in the Well-Known Binary (WKB) format and edges interpolation
* is always linear/planar.
*
* A custom CRS can be set by the crs field. If unset, it defaults to "OGC:CRS84",
* which means that the geometries must be stored in longitude, latitude based on
* the WGS84 datum.
*
* Allowed for physical type: BYTE_ARRAY.
*
* See Geospatial.md for details.
*/
struct GeometryType {
1: optional string crs;
}

/**
* Embedded Geography logical type annotation
*
* Geography features in the WKB format with an explicit (non-linear/non-planar)
* edges interpolation algorithm.
*
* A custom geographic CRS can be set by the crs field, where longitudes are
* bound by [-180, 180] and latitudes are bound by [-90, 90]. If unset, the CRS
* defaults to "OGC:CRS84".
*
* An optional algorithm can be set to correctly interpret edges interpolation
* of the geometries. If unset, the algorithm defaults to SPHERICAL.
*
* Allowed for physical type: BYTE_ARRAY.
*
* See Geospatial.md for details.
*/
struct GeographyType {
1: optional string crs;
2: optional EdgeInterpolationAlgorithm algorithm;
}

/**
* LogicalType annotations to replace ConvertedType.
*
Expand Down Expand Up @@ -417,6 +495,8 @@ union LogicalType {
14: UUIDType UUID // no compatible ConvertedType
15: Float16Type FLOAT16 // no compatible ConvertedType
16: VariantType VARIANT // no compatible ConvertedType
17: GeometryType GEOMETRY // no compatible ConvertedType
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
18: GeographyType GEOGRAPHY // no compatible ConvertedType
}

/**
Expand Down Expand Up @@ -857,6 +937,9 @@ struct ColumnMetaData {
* filter pushdown.
*/
16: optional SizeStatistics size_statistics;

/** Optional statistics specific for Geometry and Geography logical types */
17: optional GeometryStatistics geometry_statistics;
}

struct EncryptionWithFooterKey {
Expand Down Expand Up @@ -988,6 +1071,8 @@ union ColumnOrder {
* LIST - undefined
* MAP - undefined
* VARIANT - undefined
* GEOMETRY - undefined
* GEOGRAPHY - undefined
*
* In the absence of logical types, the sort order is determined by the physical type:
* BOOLEAN - false, true
Expand Down
Loading