Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial data #467

Merged
merged 62 commits into from
Jun 16, 2023
Merged
Show file tree
Hide file tree
Changes from 57 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
70f6d9b
Add header for parial data appendix
rartino Jun 8, 2023
9a5a36f
First paragraph of partial data appendix
rartino Jun 8, 2023
0680b2f
Adding a JSON-API response example and to partial data examples.
sauliusg Jun 8, 2023
4063e13
Updating the partial response examples.
sauliusg Jun 8, 2023
1a7230f
A format of partial data URLs agreed with Giovanni.
sauliusg Jun 8, 2023
14b9e9d
Removing scaffold comments.
sauliusg Jun 8, 2023
104ed78
Fixinhg the formatting: removing trailing blanks, unfolding text lines.
sauliusg Jun 8, 2023
9fa3b29
Updating the partial data examples to be consistent with the new
sauliusg Jun 8, 2023
6ec5a11
Checking spelling, updating the ".words.lst" file.
sauliusg Jun 8, 2023
92e7c12
Full text of partial data format appendix
rartino Jun 8, 2023
cc3da46
Merge branch 'partial_data' of https://github.com/rartino/OPTIMADE in…
sauliusg Jun 8, 2023
7a1f684
Slight changes in the text.
sauliusg Jun 8, 2023
0245999
Apply suggestions from review
rartino Jun 8, 2023
06d6444
Apply suggestions from review
rartino Jun 8, 2023
3e5fa16
Delete trailing whitespace
rartino Jun 8, 2023
7eddd27
Fix descriptio of the data -> meta fields in the JSON response format
rartino Jun 8, 2023
3e1f04c
Fixing the "next" link definition.
sauliusg Jun 9, 2023
05eacc2
Update optimade.rst
rartino Jun 10, 2023
beaaeef
Apply suggestions from review
rartino Jun 10, 2023
50c355e
Update based on review
rartino Jun 10, 2023
7a92260
Revert unneseccary change to .words.lst
rartino Jun 11, 2023
8f4db09
Apply suggestions from review
rartino Jun 12, 2023
16d60f6
Slightly change the format of the markers
rartino Jun 12, 2023
e109706
Improve clarity for when number of lines does not match response_range
rartino Jun 12, 2023
34bdf2a
Remove trailing whitespace
rartino Jun 12, 2023
961f5b7
Apply suggestions from review
rartino Jun 14, 2023
874bd52
Apply suggestions from review
rartino Jun 15, 2023
7b314af
Add a key to the header to identify the format as OPTIMADE partial data
rartino Jun 15, 2023
6faf8db
Remove trailing whitespace
rartino Jun 15, 2023
316df78
Clarify handling of missing items in partial data
rartino Jun 15, 2023
b080cf2
Change markers to be more detectable in stream
rartino Jun 15, 2023
bd93804
Change markers to be more detectable in stream
rartino Jun 15, 2023
10bc845
Change markers to be more detectable in stream
rartino Jun 15, 2023
39d9ae5
Change format to representation to avoid a clash in terms and fieldnames
rartino Jun 15, 2023
2a24c1a
Enable for efficient parsing of responses a server knows has no refer…
rartino Jun 15, 2023
9d9e26e
Change format to representation to avoid a clash in terms and fieldnames
rartino Jun 15, 2023
ff5a27c
Rename partial_data_url and url to link to better conform to JSON API…
rartino Jun 15, 2023
8ae1928
Rename partial_data_url and url to link to better conform to JSON API…
rartino Jun 15, 2023
d8a11cb
Rename partial_data_url and url to link to better conform to JSON API…
rartino Jun 15, 2023
11900c5
Rename partial_data_url and url to link to better conform to JSON API…
rartino Jun 15, 2023
1b4093e
Remove trailing whitespace
rartino Jun 15, 2023
496b6ca
Change representation to layout to not confuse with URL representatio…
rartino Jun 15, 2023
4d906a2
Remove accidental leftover text.
rartino Jun 15, 2023
b6ab3ae
Fix segment incorrectly placed
rartino Jun 15, 2023
ee4c1e3
Fix braces in partial data examples
rartino Jun 15, 2023
1b0d1a6
Make returned_range RECOMMENDED and move a sentence that had ended up…
rartino Jun 15, 2023
1b9c607
Fix whitespace
rartino Jun 15, 2023
562d651
Improve formulation about partial data URLs
rartino Jun 15, 2023
498d169
Slightly adjust wording
rartino Jun 15, 2023
e5e6046
Slightly adjust wording
rartino Jun 15, 2023
4906c4f
Slightly adjust wording
rartino Jun 15, 2023
864450d
Slightly adjust wording
rartino Jun 15, 2023
e574106
Minor reformulations
rartino Jun 15, 2023
336ef21
Minor reformulations
rartino Jun 15, 2023
93ee583
Rearrange some text to be more logical
rartino Jun 15, 2023
edf4f25
Clarify optimade-partial-data/format field futureproofing
rartino Jun 15, 2023
5b13315
Minor reformulations and adjustments
rartino Jun 15, 2023
2cfe8c0
Allow an inline item_schema in addition to the link
rartino Jun 15, 2023
4e9fb4d
Fix missing quotation marks
rartino Jun 15, 2023
b50d93d
Minor language corrections from review
rartino Jun 16, 2023
dfc24d4
Add sentence about implementations decision on what is partial data
rartino Jun 16, 2023
a0aa533
Merge branch 'develop' into partial_data
rartino Jun 16, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .words.lst
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ bandgap
bd
booktitle
boolean
bzip
calc
cartesian
checksums
Expand Down Expand Up @@ -115,18 +116,21 @@ exclusiveMinimum
exmpl
fieldname
firstname
hdf
howpublished
href
html
http
hydrogens
hydroperoxide
implementers
incrementing
internaldb
javascript
json
jsonapi
jsonc
jsonlines
kvak
lastname
libc
Expand Down Expand Up @@ -203,4 +207,4 @@ xy
yacc
zeo
zeolites
�ngstr�m
ångström
rartino marked this conversation as resolved.
Show resolved Hide resolved
251 changes: 250 additions & 1 deletion optimade.rst
Original file line number Diff line number Diff line change
Expand Up @@ -442,6 +442,55 @@ For example, the following query can be sent to API implementations `exmpl1` and

:filter:`filter=_exmpl1_band_gap<2.0 OR _exmpl2_band_gap<2.5`

Transmission of large property values
-------------------------------------

A property value may be too large to fit in a single response.
OPTIMADE provides a mechanism for a client to handle such properties by fetching them in separate series of requests.
rartino marked this conversation as resolved.
Show resolved Hide resolved

In this case, the response to the initial query gives the value :val:`null` for the property.
A list of one or more data URLs together with their respective partial data formats are given in the response.
How this list is provided is response format-dependent.
For the JSON response format, see the description of the :field:`partial_data_links` field, nested under :field:`data` and then :field:`meta`, in the section `JSON Response Schema: Common Fields`_.

The default partial data format is named "jsonlines" and is described in the Appendix `OPTIMADE JSON lines partial data format`_.
An implementation SHOULD always include this format as one of alternative partial data formats provided for a property that has been omitted from the response to the initial query.
Implementations MAY provide links to their own non-standard formats, but non-standard format names MUST be prefixed by a database-provider-specific prefix.

Below follows an example of the :field:`data` and :field:`meta` parts of a response using the JSON response format that communicates that the property value has been omitted from the response, with three different links for different partial data formats provided.

.. code:: jsonc

{
// ...
"data": {
"type": "structures",
"id": "2345678",
"attributes": {
"a": null
}
"meta": {
"partial_data_links": {
"a": [
{
"format": "jsonlines",
"link": "https://example.org/optimade/v1.2/extensions/partial_data/structures/2345678/a/default_format"
},
{
"format": "_exmpl_bzip2_jsonlines",
"link": "https://db.example.org/assets/partial_values/structures/2345678/a/bzip2_format"
},
{
"format": "_exmpl_hdf5",
"link": "https://cloud.example.org/ACCHSORJGIHWOSJZG"
}
]
}
}
}
// ...
}

Responses
=========

Expand Down Expand Up @@ -593,6 +642,22 @@ Every response SHOULD contain the following fields, and MUST contain at least :f
- **data**: The schema of this value varies by endpoint, it can be either a *single* `JSON API resource object <http://jsonapi.org/format/1.0/#document-resource-objects>`__ or a *list* of JSON API resource objects.
Every resource object needs the :field:`type` and :field:`id` fields, and its attributes (described in section `API Endpoints`_) need to be in a dictionary corresponding to the :field:`attributes` field.

Every resource object MAY also contain a :field:`meta` field with the following keys:

- **partial_data_links**: an object used to list links which can be used to fetch data that has been omitted from the :field:`data` part of the response.
The keys are the names of the fields in :field:`attributes` for which partial data links are available.
Each value is a list of items that MUST have the following keys:

- **format**: String.
A name of the format provided via this link.
One of the items SHOULD be "jsonlines", which refers to the format in `OPTIMADE JSON lines partial data format`_.

- **link**: String.
A `JSON API link <http://jsonapi.org/format/1.0/#document-links>`__ that points to a location from which the omitted data can be fetched.
rartino marked this conversation as resolved.
Show resolved Hide resolved
There is no requirement on the syntax or format for the link URL.

For more information about the mechanism to transmit large property values, including an example of the format of :field:`partial_data_links`, see `Transmission of large property values`_.

giovannipizzi marked this conversation as resolved.
Show resolved Hide resolved
The response MAY also return resources related to the primary data in the field:

- **links**: `JSON API links <http://jsonapi.org/format/1.0/#document-links>`__ is REQUIRED for implementing pagination.
Expand Down Expand Up @@ -915,7 +980,8 @@ OPTIONALLY it can also contain the following fields:

- **self**: the entry's URL

- **meta**: a `JSON API meta object <https://jsonapi.org/format/1.0/#document-meta>`__ that contains non-standard meta-information about the object.
- **meta**: a `JSON API meta object <https://jsonapi.org/format/1.0/#document-meta>`__ that is used to communicate metadata.
See `JSON Response Schema: Common Fields`_ for more information about this field.
Comment on lines +984 to +985
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not come from PR #463?

Suggested change
- **meta**: a `JSON API meta object <https://jsonapi.org/format/1.0/#document-meta>`__ that is used to communicate metadata.
See `JSON Response Schema: Common Fields`_ for more information about this field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to define the meta field here to hold the "partial_data_links" key? Otherwise this PR would be inconsistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I came across another commit which removed part of the definition of the metadata fields. So it looked like you forgot this piece, which is why I mentioned it.
Either both should be in or both should be out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier I indeed removed a segment here that defined the property_metadata subkey of meta, which I agree belong better in #463. But, the segment you have marked now defines the meta superkey we need for the partial_data_links subkey.

I'm confused over what you are asking for. Are you saying we absolutely should not mention meta with a link to 'JSON Response Schema: Common Fields' that defines meta -> partial_data_links; despite that with this PR that key is an absolutely vital part of the 'Entry Listing JSON Response Schema'?


- **relationships**: a dictionary containing references to other entries according to the description in section `Relationships`_ encoded as `JSON API Relationships <https://jsonapi.org/format/1.0/#document-resource-object-relationships>`__.
The OPTIONAL human-readable description of the relationship MAY be provided in the :field:`description` field inside the :field:`meta` dictionary of the JSON API resource identifier object.
Expand Down Expand Up @@ -3421,3 +3487,186 @@ The strings below contain Extended Regular Expressions (EREs) to recognize ident
#BEGIN ERE strings
"([^\"]|\\.)*"
#END ERE strings

OPTIMADE JSON lines partial data format
---------------------------------------
The OPTIMADE JSON lines partial data format is a lightweight format for transmitting property data that are too large to fit in a single OPTIMADE response.
The format is based on `JSON Lines <https://jsonlines.org/>`__, which enables streaming of JSON data.
Note: since the below definition references both JSON fields and OPTIMADE properties, the data type names depend on context: for JSON they are, e.g., "array" and "object" and for OPTIMADE properties they are, e.g., "list" and "dictionary".

.. _slice object:

To aid the definition of the format below, we first define a "slice object" to be a JSON object describing slices of arrays.
The dictionary has the following OPTIONAL fields:

- :field:`"start"`: Integer.
The slice starts at the value with the given index (inclusive).
The default is 0, i.e., the value at the start of the array.
- :field:`"stop"`: Integer.
The slice ends at the value with the given index (inclusive).
If omitted, the end of the slice is the last index of the array.
- :field:`"step"`: Integer.
The absolute difference in index between two subsequent values that are included in the slice.
The default is 1, i.e., every value in the range indicated by :field:`start` and :field:`stop` is included in the slice.
Hence, a value of 2 denotes a slice of every second value in the array.

For example, for the array :val:`["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]` the slice object :val:`{"start": 1, "end": 7, "step": 3}` refers to the items :val:`["b", "e", "h"]`.

Furthermore, we also define the following special markers:

- The *end-of-data-marker* is this exact JSON: :val:`["PARTIAL-DATA-END", [""]]`.
- A *reference-marker* is this exact JSON: :val:`["PARTIAL-DATA-REF", ["<url>"]]`, where :val:`"<url>"` is to be replaced with a URL being referenced.
A reference-marker MUST only occur in a place where the property being communicated could have an embedded list.
- A *next-marker* is this exact JSON: :val:`["PARTIAL-DATA-NEXT", ["<url>"]]`, where :val:`"<url>"` is to be replaced with the target URL for the next link.

There is no requirement on the syntax or format of the URLs provided in these markers.
When data is fetched from these URLs the response MUST use the JSON lines partial data format, i.e., the markers cannot be used to link to partial data provided in other formats.
The markers have been deliberately designed to be valid JSON objects but *not* valid OPTIMADE property values.
Since the OPTIMADE list data type is defined as a list of values of the same data type or :val:`null`, the above markers cannot be encountered inside the actual data of an OPTIMADE property.

**Implementation note:** the recognizable string values for the markers should make it possible to prescreen the raw text of the JSON data lines for the reference-marker string to determine which lines that one can exclude from further processing to resolve references (alternatively, this screening can be done by the string parser used by the JSON parser).
The undelying design idea is that for lines that have reference-markers, the time it takes to process the data structure to locate the markers should be negliable compared to the time it takes to resolve and handle the large data they reference.
rartino marked this conversation as resolved.
Show resolved Hide resolved
Hence, the most relevant optimization is to avoid spending time processing data structures to find markers for lines where there are none.

The full response MUST be valid `JSON Lines <https://jsonlines.org/>`__ that adheres to the following format:

- The first line is a header object (defined below).
- The following lines are data lines adhering to the formats described below.
- The final line is either an end-of-data-marker (indicating that there is no more data to be given), or a next-marker indicating that more data is available, which can be obtained by retrieving data from the provided URL.

The first line MUST be a JSON object providing header information.
The header object MUST contain the keys:

- :field:`"optimade-partial-data"`: Object.
An object identifying the response as being on OPTIMADE partial data format.

It MUST contain the following key:

- :field:`"format"`: String.
Specifies the minor version of the partial data format used. The string MUST be of the format "MAJOR.MINOR", referring to the version of the OPTIMADE standard that describes the format. The version number string MUST NOT be prefixed by, e.g., "v". In implementations of the present version of the standard, the value MUST be exactly :val:`1.2`.
A client MUST NOT expect to be able to parse the format if the field is not a string of the format MAJOR.MINOR or if the MAJOR version number is unrecognized.

- :field:`"layout"`: String.
A string either equal to :val:`"dense"` or :val:`"sparse"` to indicate whether the returned format uses a dense or sparse layout.

The following key is RECOMMENDED in the header object:

- :field:`"returned_ranges"`: Array of Object.

This comment was marked as outdated.

For dense layout, and sparse layout of one dimensional list properties, the array contains a single element which is a `slice object`_ representing the range of data present in the response.
In the specific case of a hierarchy of list properties represented as a sparse multi-dimensional array, if the field :field:`"returned_ranges"` is given, it MUST contain one slice object per dimension of the multi-dimensional array, representing slices for each dimension that cover the data given in the response.

The header object MAY also contain the keys:

- :field:`"property_name"`: String.
The name of the property being provided.

- :field:`"entry"`: Object.
An object that MUST have the following two keys:

- :field:`"id"`: String.
The id of the entry of the property being provided.

- :field:`"type"`: String.
The type of the entry of the property being provided.

- :field:`"has_references"`: Boolean.
An optional boolean to indicate whether any of the data lines in the response contains a reference marker.
A value of :val:`false` means that the client does not have to process any of the lines to detect reference markers, which may speed up the parsing.

- :field:`"links"`: Object.
An object to provide relevant links for the property being provided.
It MAY contain the following key:

- :field:`base_url`: String.
rartino marked this conversation as resolved.
Show resolved Hide resolved
The base URL of the implementation serving the database to which this property belongs.
JPBergsma marked this conversation as resolved.
Show resolved Hide resolved

rartino marked this conversation as resolved.
Show resolved Hide resolved
- :field:`"item_schema"`: String.
A URL to a JSON Schema that validates the data lines of the response.
The format SHOULD be the relevant partial extract of a valid property definition as described in `Property Definitions`_.
If a schema is provided, it MUST be a valid JSON schema using the same version of JSON schema as described in that section.

rartino marked this conversation as resolved.
Show resolved Hide resolved
The format of data lines of the response (i.e., all lines except the first and the last) depends on whether the header object specifies the layout as :val:`"dense"` or :val:`"sparse"`.

- **Dense layout:** In the dense partial data layout, each data line reproduces one list item in the OPTIMADE list property being transmitted in JSON format.
If OPTIMADE list properties are embedded inside the item, they can either be included in full or replaced with a reference-marker.
If a list is replaced by a reference marker, the client MAY use the provided URL to obtain the list items.
If the field :field:`"returned_ranges"` is omitted, then the client MUST assume that the data is a continuous range of data from the start of the array up to the number of elements given until reaching the end-of-data-marker or next-marker.

- **Sparse layout for one-dimensional list:** When the response sparsely communicates items for a one-dimensional OPTIMADE list property, each data line contains a JSON array on the format:

- The first item of the array is the zero-based index of list property item being provided by this line.
- The second item of the array is the list property item located at the indicated index, represented using the same format as each line in the dense layout.
In the same way as for the dense layout, reference-markers are allowed inside the item data for embedded lists that do not fit in the response (see example below).

- **Sparse layout for multi-dimensional lists:** the server MAY use a specific sparse layout for the case that the OPTIMADE property represents a series of directly hierarchically embedded lists (i.e., a multidimensional sparse array).
In this case, each data line contains a JSON array of the format:

- All array items except the last one are integer zero-based indices of the list property item being provided by this line; these indices refer to the aggregated dimensions in the order of outermost to innermost.
- The last item of the array is the list property item located at the indicated coordinates, represented using the same format as each line in the dense layout.
In the same way as for the dense layout, reference-markers are allowed inside the item data for embedded lists that do not fit in the response (see example below).

If the final line of the response is a next-marker, the client MAY continue fetching the data for the property by retriving another partial data response from the provided URL.
If the final line is an end-of-data-marker, any data not covered by any of the responses are to be assigned the value :val:`null`.

If :field:`"returned_ranges"` is included in the response and the client encounters a next-marker before receiving all lines indicated by the slice, it should proceed by not assigning any values to the corresponding items, i.e., this is not an error.
Since the remaining values are not assigned a value, they will be :val:`null` if they are not assigned values by another response retrieved via a next link encountered before the final end-of-data-marker.
(Since there is no requirement that values are assigned in a specific order between responses, it is possible that the omitted values are already assigned.
In that case the values shall remain as assigned, i.e., they are not overwritten by :val:`null` in this situation.)

Examples
~~~~~~~~

Below follows an example of a dense response for a partial array data of integer values.
The request returns the first three items and provides the next-marker link to continue fetching data:

.. code:: json

{"optimade-partial-data": {"format": "1.2.0"}, "layout": "dense", "returned_ranges": [{"start": 10, "stop": 20, "step": 2}]}
123
345
-12.6
["PARTIAL-DATA-NEXT", ["https://example.db.org/value4"]]

Below follows an example of a dense response for a list property as a partial array of multidimensional array values.
The item with index 10 in the original list is provided explicitly in the response and is the first one provided in the response since start=10.
The item with index 12 in the list, the second data item provided since start=10 and step=2, is not included only referenced.
The third provided item (index 14 in the original list) is only partially returned: it is a list of three items, the first and last are explicitly provided, the second one is only referenced.

.. code:: json

{"optimade-partial-data": {"format": "1.2.0"}, "layout": "dense", "returned_ranges": [{"start": 10, "stop": 20, "step": 2}]}
[[10,20,21], [30,40,50]]
["PARTIAL-DATA-REF", ["https://example.db.org/value2"]]
[[11, 110], ["PARTIAL-DATA-REF", ["https://example.db.org/value3"]], [550, 333]]
["PARTIAL-DATA-NEXT", ["https://example.db.org/value4"]]

Below follows an example of the sparse layout for multi-dimensional lists with three aggregated dimensions.
The underlying property value can be taken to be sparse data in lists in four dimensions of 10000 x 10000 x 10000 x N, where the innermost list is a non-sparse list of abitrary length of numbers.
The only non-null items in the outer three dimensions are, say, [3,5,19], [30,15,9], and [42,54,17].
The response below communicates the first item explicitly; the second one by deferring the innermost list using a reference-marker; and the third item is not included in this response, but deferred to another page via a next-marker.

.. code:: json

{"optimade-partial-data": {"format": "1.2.0"}, "layout": "sparse"}
[3,5,19, [10,20,21,30]]
[30,15,9, ["PARTIAL-DATA-REF", ["https://example.db.org/value1"]]]
["PARTIAL-DATA-NEXT", ["https://example.db.org/"]]

An example of the sparse layout for multi-dimensional lists with three aggregated dimensions and integer values:

.. code:: json

{"optimade-partial-data": {"format": "1.2.0"}, "layout": "sparse"}
[3,5,19, 10]
[30,15,9, 31]
["PARTIAL-DATA-NEXT", ["https://example.db.org/"]]

An example of the sparse layout for multi-dimensional lists with three aggregated dimensions and values that are multidimensional lists of integers of arbitrary lengths:

.. code:: json

{"optimade-partial-data": {"format": "1.2.0"}, "layout": "sparse"}
[3,5,19, [ [10,20,21], [30,40,50] ] ]
[3,7,19, ["PARTIAL-DATA-REF", ["https://example.db.org/value2"]]]
[4,5,19, [ [11, 110], ["PARTIAL-DATA-REF", ["https://example.db.org/value3"]], [550, 333]]]
["PARTIAL-DATA-END", [""]]