Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2471: Add geometry logical type #240

Open
wants to merge 34 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
5c9e110
WIP: Add geometry logical type
wgtmac May 10, 2024
5ef28cd
address various comments
wgtmac May 25, 2024
ecd8cc2
add file level geo stats
wgtmac May 27, 2024
d81dacb
address feedback:
wgtmac May 31, 2024
80f4051
change naming and remove controversial items
wgtmac Jun 13, 2024
0db6d9f
address feedback
wgtmac Jun 16, 2024
e817af4
fix typo
wgtmac Jun 16, 2024
f78f7bd
use WKB type code
wgtmac Jun 19, 2024
1aaaca8
Update covering and geometry type protocol based on comments (#2)
zhangfengcdt Aug 7, 2024
ee5b2df
Add the new suggestion according to the meeting with Snowflake (#3)
jiayuasu Aug 15, 2024
19cc081
change metadata to string type and rewording WKB description
wgtmac Aug 20, 2024
16c5868
add example for crs
wgtmac Aug 21, 2024
56a65de
reword crs
wgtmac Aug 21, 2024
f28b282
clarify WKB
wgtmac Aug 22, 2024
5127702
clarify coverings
wgtmac Aug 24, 2024
298ab64
Update the suggestion for bbox stats (#4)
jiayuasu Sep 11, 2024
41c6394
Update src/main/thrift/parquet.thrift
wgtmac Sep 20, 2024
d86abe4
Update src/main/thrift/parquet.thrift
wgtmac Sep 20, 2024
c7a4f4c
Update src/main/thrift/parquet.thrift
wgtmac Sep 20, 2024
f20f685
Update src/main/thrift/parquet.thrift
wgtmac Sep 20, 2024
dbf9d54
address feedback about edges and wkb
wgtmac Sep 20, 2024
b4296aa
add geoparquet column metadata back
wgtmac Sep 27, 2024
9bcea6e
Update the spec according to the new feedback (#5)
jiayuasu Oct 4, 2024
99f0403
Update src/main/thrift/parquet.thrift
wgtmac Oct 12, 2024
dbb78cf
Update src/main/thrift/parquet.thrift
wgtmac Oct 12, 2024
25df0ff
add description to LogicalTypes.md
wgtmac Oct 13, 2024
d349727
add explanation for Z & M values
wgtmac Oct 13, 2024
9ea6559
move geo stats to ColumnMetaData
wgtmac Oct 16, 2024
011de45
Update src/main/thrift/parquet.thrift
wgtmac Oct 17, 2024
6425a3c
fix typo
wgtmac Oct 17, 2024
7d8ffa5
Merge branch 'master' into geo
wgtmac Nov 7, 2024
1502458
remove edges and simplify crs
wgtmac Nov 22, 2024
9f53c9e
Add geography type
wgtmac Dec 13, 2024
a4f79ca
remove wrong content
wgtmac Dec 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
182 changes: 182 additions & 0 deletions LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -767,6 +767,188 @@ optional group my_map (MAP_KEY_VALUE) {
}
```

## Geospatial Types

### GEOMETRY

`GEOMETRY` is used for geometry features from [OGC – Simple feature access][simple-feature-access].
See [Geospatial Notes](#geospatial-notes).

The type has three type parameters:
- `encoding`: A required enum value for annonated physical type and encoding
for the `GEOMETRY` type. See [Geometry Encoding](#geometry-encoding).
- `edges`: A required enum value for interpretation for edges of elements of the
`GEOMETRY` type, i.e. whether the interpolation between points along
an edge represents a straight cartesian line or the shortest line on
the sphere. See [Edges](#edges).
- `crs`: An optional string value for CRS (coordinate reference system), which
is a mapping of how coordinates refer to precise locations on earth.
See [Coordinate Reference System](#coordinate-reference-system).

The sort order used for `GEOMETRY` is undefined. When writing data, no min/max
statistics should be saved for this type and if such non-compliant statistics
are found during reading, they must be ignored. Instead, [GeometryStatistics](#geometry-statistics)
is introduced for `GEOMETRY` type.

#### Geometry Encoding

Physical type and encoding for the `GEOMETRY` type. Supported values:
- `WKB`: `GEOMETRY` type with `WKB` encoding can only be used to annotate the
`BYTE_ARRAY` primitive type. See [WKB](#well-known-binary-wkb).

Note that geometry encoding is required for `GEOMETRY` type. In order to correctly
interpret geometry data, writer implementations SHOULD always set this field, and
reader implementations SHOULD fail for an unknown geometry encoding value.

##### Well-known binary (WKB)

Well-known binary (WKB) representations of geometries, see [Geospatial Notes](#geospatial-notes).

To be clear, we follow the same definitions of GeoParquet for [WKB][geoparquet-wkb]
and [coordinate axis order][coordinate-axis-order]:
- Geometries SHOULD be encoded as ISO WKB supporting XY, XYZ, XYM, XYZM. Supported
standard geometry types: Point, LineString, Polygon, MultiPoint, MultiLineString,
MultiPolygon, and GeometryCollection.
- Coordinate axis order is always (x, y) where x is easting or longitude, and
y is northing or latitude. This ordering explicitly overrides the axis order
as specified in the CRS following the [GeoPackage specification][geopackage-spec].

This is the preferred encoding for maximum portability.

[geoparquet-wkb]: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L92
[coordinate-axis-order]: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L155
[geopackage-spec]: https://www.geopackage.org/spec130/#gpb_spec

#### Edges

Interpretation for edges of elements of `GEOMETRY` type. In other words, it
specifies how a point between two vertices should be interpolated in its XY
dimensions. Supported values and corresponding interpolation approaches are:
- `PLANAR`: a Cartesian line connecting the two vertices.
- `SPHERICAL`: a shortest spherical arc between the longitude and latitude
represented by the two vertices.

This value applies to all non-point geometry objects and is independent of the
[Coordinate Reference System](#coordinate-reference-system).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I work with Salesforce Data Cloud team, and evaluating GeoSpatial support in iceberg)
I am new to geospatial world, and wondering what does it mean for edges to be independent of underlying CRS? Can the edges be planar while the CRS is based on elliptic geometry?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the edges be planar while the CRS is based on elliptic geometry?

In principle, no. First, talking about "planar edges" or "spherical edges" makes no sense and was a confusion of terms in the initial draft of this specification (the group reached an agreement to fix that in recent talks, I hope it will be done before release). An edge can be a straight line, a curve, a geodesic, etc., but cannot be a plane or a sphere (because of wrong number of dimensions).

What the initial draft intended to say with "planar edges" (sic) is "edges computed as if they were in a planar (two-dimensional Cartesian) coordinate system" (the thing that is planar is the coordinate system, not the edges). This is not really correct for geographic CRS, so you are right to said that they are not really independent. However, while it would be more exact to said that lines on a geographic CRS are geodesics, loxodrome, etc., it happens often that software ignore that physical reality and just perform linear interpolations of latitude and longitude values. The line on the ellipsoid surface obtained that way has no interesting properties, it is just easy to compute. We do not recommend doing that, but the use of "planar" word in this context was an acknowledgement that it happens in practice and an attempt to describe that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the response. I do not understand what this parameter is used for in parquet. If it is the engine's property to treat the edges, how is this value helping? The engine capable of interpreting edges as geodesics should do so if the CRS reference indicates that the underlying geometry column belongs to an ellipsoid datum. Is this edge property forcing the engine to treat the values in a planar coordinate system?

In other words, is there something intrinsic to the data stored in the parquet file itself where edge parameter makes a difference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@redblackcoder

The Geo and Iceberg community are discussing the best way to describe this field. It is very likely that we will want to rename edges property to something else because this is not what we want to describe initially. We will post updates in a few days.

Copy link

@mentin mentin Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The engine capable of interpreting edges as geodesics should do so if the CRS reference indicates that the underlying geometry column belongs to an ellipsoid datum.

Consider the most common case, SRID 4326. It is Geographic coordinate system (GEOGCS) rather than Projected one.
https://www.esri.com/arcgis-blog/products/arcgis-pro/mapping/gcs_vs_pcs/

So the linestring from A to B should follow the geodesic line. But most systems treat 4326 as planar map. E.g. with Geometry type in PostGIS or MS SQL Server, they treat it as projected coordinate system, and the linestrings follow straight lines on flat surface. If you use latest MySQL or Geography type in PostGIS or MS SQL Server, the linestrings in 4326 follow geodesic lines on sphere. So there is ambiguity what exactly a linestring or polygon in 4326 describes. Is 'point(30 21) inside polygon((10 10, 50 10, 50 20, 10 20, 10 10))?

With geometry, in PostGIS, returns false:

select st_intersects(
  st_geomfromtext('polygon((10 10, 50 10, 50 20, 10 20, 10 10))', 4326), 
  st_geomfromtext('point(30 21)', 4326));

Same thing with geography (4326 is presumed), returns true:

 select st_intersects(
  st_geographyfromtext('srid=4326;polygon((10 10, 50 10, 50 20, 10 20, 10 10))'),
  st_geographyfromtext('srid=4326;point(30 21)'));

Unfortunately, there is no accepted way to describe the difference between geometry and geography in WKB format. You can encounter SRID=4326 with both interpretations. The edge attribute allows describing the difference between geometry and geography, and tells user how to interpret the data in a way consistent with the system that produced it.


Because most systems currently assume planar edges and do not support spherical
edges, `PLANAR` should be used as the default value.

Note that edges is required for `GEOMETRY` type. In order to correctly
interpret geometry data, writer implementations SHOULD always set this field,
and reader implementations SHOULD fail for an unknown edges value.

#### Coordinate Reference System

CRS (coordinate reference system) is a mapping of how coordinates refer to
precise locations on earth. A CRS is specified by a key-value entry in the
`key_value_metadata` field of `FileMetaData` whose key is a short name of
the CRS and value is the CRS representation. An additional entry in the
`key_value_metadata` field with the suffix ".type" is required to describe
the encoding of this CRS representation.
wgtmac marked this conversation as resolved.
Show resolved Hide resolved

For example, if a geometry column (e.g., "geom1") uses the CRS "OGC:CRS84", the
writer may write two entries to `key_value_metadata` field of `FileMetaData` as
below, and set the `crs` field of the `GEOMETRY` type to "geom1_crs":
```
"geom1_crs": an UTF-8 encoded PROJJSON representation of OGC:CRS84
"geom1_crs.type": "PROJJSON"
```

The PROJJSON representation of OGC:CRS84 can be seen at [OGC:CRS84][ogc-crs84].
Multiple geometry columns can refer to the same CRS metadata field
(e.g., "geom1_crs") if they share the same CRS.

[ogc-crs84]: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#ogccrs84-details

#### Geometry Statistics

`GeometryStatistics` is a struct to store geometry statistics of a column chunk
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
of `GEOMETRY` type. It is an optional field of `ColumnMetaData` and contains
[Bounding Box](#bounding-box) and [Geometry Types](#geometry-types).

##### Bounding Box

A geometry has at least two coordinate dimensions: X and Y for 2D coordinates
of each point. A geometry can optionally have Z and / or M values associated
with each point in the geometry.

The Z values introduce the third dimension coordinate. Usually they are used
to indicate the height, or elevation.

M values are an opportunity for a geometry to express a fourth dimension as
a coordinate value. These values can be used as a linear reference value
(e.g., highway milepost value), a timestamp, or some other value as defined
by the CRS.

Bounding box is defined as the thrift struct below in the representation of
min/max value pair of coordinates from each axis. Note that X and Y Values
are always present. Z and M are omitted for 2D geometries.

```thrift
struct BoundingBox {
/** Min X value when edges = PLANAR, westmost value if edges = SPHERICAL */
1: required double xmin;
/** Max Y value when edges = PLANAR, eastmost value if edges = SPHERICAL */
2: required double xmax;
/** Min Y value when edges = PLANAR, southmost value if edges = SPHERICAL */
3: required double ymin;
/** Max Y value when edges = PLANAR, northmost value if edges = SPHERICAL */
4: required double ymax;
/** Min Z value if the axis exists */
5: optional double zmin;
/** Max Z value if the axis exists */
6: optional double zmax;
/** Min M value if the axis exists */
7: optional double mmin;
/** Max M value if the axis exists */
8: optional double mmax;
}
```

The meaning of each value depends on the `Edges` attribute of the `GEOMETRY` type:
- If Edges is `PLANAR`, the values are literally the actual min/max value from each axis.
- If Edges is `SPHERICAL`, the values for X and Y are `[westmost, eastmost, southmost, northmost]`,
with necessary min/max values for Z and M if needed.

##### Geometry Types

A list of geometry types from all geometries in the `GEOMETRY` column, or an
empty list if they are not known.

This is borrowed from [geometry_types of GeoParquet][geometry-types]
except that values in the list are [WKB (ISO-variant) integer codes][wkb-integer-code].
Table below shows the most common geometry types and their codes:

| Type | XY | XYZ | XYM | XYZM |
| :----------------- | :--- | :--- | :--- | :--: |
| Point | 0001 | 1001 | 2001 | 3001 |
| LineString | 0002 | 1002 | 2002 | 3002 |
| Polygon | 0003 | 1003 | 2003 | 3003 |
| MultiPoint | 0004 | 1004 | 2004 | 3004 |
| MultiLineString | 0005 | 1005 | 2005 | 3005 |
| MultiPolygon | 0006 | 1006 | 2006 | 3006 |
| GeometryCollection | 0007 | 1007 | 2007 | 3007 |

In addition, the following rules are applied:
- A list of multiple values indicates that multiple geometry types are present (e.g. `[0003, 0006]`).
- An empty array explicitly signals that the geometry types are not known.
- The geometry types in the list must be unique (e.g. `[0001, 0001]` is not valid).
wgtmac marked this conversation as resolved.
Show resolved Hide resolved

[geometry-types]: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L159
[wkb-integer-code]: https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary

#### Geospatial Notes

The Geometry class hierarchy and its WKT and WKB serializations (ISO supporting
XY, XYZ, XYM, XYZM) are defined by [OpenGIS Implementation Specification for
Geographic information – Simple feature access – Part 1: Common architecture](
https://portal.ogc.org/files/?artifact_id=25355), from [OGC (Open Geospatial
Consortium)](https://www.ogc.org/standard/sfa/).

The version of the OGC standard first used here is 1.2.1, but future versions
may also used if the WKB representation remains wire-compatible.

## UNKNOWN (always null)

Sometimes, when discovering the schema of existing data, values are always null
Expand Down
70 changes: 70 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,37 @@ struct SizeStatistics {
3: optional list<i64> definition_level_histogram;
}

/**
* Bounding box of geometries in the representation of min/max value pair of
* coordinates from each axis.
*/
struct BoundingBox {
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
/** Min X value when edges = PLANAR, westmost value if edges = SPHERICAL */
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
1: required double xmin;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
/** Max Y value when edges = PLANAR, eastmost value if edges = SPHERICAL */
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
2: required double xmax;
/** Min Y value when edges = PLANAR, southmost value if edges = SPHERICAL */
3: required double ymin;
/** Max Y value when edges = PLANAR, northmost value if edges = SPHERICAL */
4: required double ymax;
/** Min Z value if the axis exists */
5: optional double zmin;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
/** Max Z value if the axis exists */
6: optional double zmax;
/** Min M value if the axis exists */
7: optional double mmin;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
/** Max M value if the axis exists */
8: optional double mmax;
}

/** Statistics specific to GEOMETRY logical type */
struct GeometryStatistics {
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
/** A bounding box of geometries */
1: optional BoundingBox bbox;
/** Geometry type codes of all geometries, or an empty list if not known */
2: optional list<i32> geometry_types;
}

/**
* Statistics per row group and per page
* All fields are optional.
Expand Down Expand Up @@ -380,6 +411,40 @@ struct JsonType {
struct BsonType {
}

/** Physical type and encoding for the geometry type */
enum GeometryEncoding {
/**
* Allowed for physical type: BYTE_ARRAY.
*
* Well-known binary (WKB) representations of geometries.
*/
WKB = 0;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
}

/** Interpretation for edges of elements of a GEOMETRY type */
enum Edges {
PLANAR = 0;
SPHERICAL = 1;
}

/**
* GEOMETRY logical type annotation (added in 2.11.0)
*
* GeometryEncoding and Edges are required. In order to correctly interpret
* geometry data, writer implementations SHOULD always them, and reader
* implementations SHOULD fail for unknown values.
*
* CRS is optional. Once CRS is set, it MUST be a key to an entry in the
* `key_value_metadata` field of `FileMetaData`.
*
* See LogicalTypes.md for detail.
*/
struct GeometryType {
1: required GeometryEncoding encoding;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
2: required Edges edges;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
3: optional string crs;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
pitrou marked this conversation as resolved.
Show resolved Hide resolved
}

/**
* LogicalType annotations to replace ConvertedType.
*
Expand Down Expand Up @@ -410,6 +475,7 @@ union LogicalType {
13: BsonType BSON // use ConvertedType BSON
14: UUIDType UUID // no compatible ConvertedType
15: Float16Type FLOAT16 // no compatible ConvertedType
16: GeometryType GEOMETRY // no compatible ConvertedType
}

/**
Expand Down Expand Up @@ -850,6 +916,9 @@ struct ColumnMetaData {
* filter pushdown.
*/
16: optional SizeStatistics size_statistics;

/** Optional statistics specific to GEOMETRY logical type */
17: optional GeometryStatistics geometry_stats;
}

struct EncryptionWithFooterKey {
Expand Down Expand Up @@ -980,6 +1049,7 @@ union ColumnOrder {
* ENUM - unsigned byte-wise comparison
* LIST - undefined
* MAP - undefined
* GEOMETRY - undefined
*
* In the absence of logical types, the sort order is determined by the physical type:
* BOOLEAN - false, true
Expand Down