Skip to content

Scope of JSON Schema Validation

Austin Wright edited this page Feb 24, 2020 · 12 revisions

This page fleshes out some of the considerations on what is in scope for JSON Schema Validation, and relationships to other features of JSON Schema.

Validation vs. linting

Depending on the kind of document language, there's a variety of passes that may need to be done over the document in order to parse it for information.

More complicated parsers may have different passes such as tokenizers and lexers. Simpler parsers will be a mere state machine, that consumes each byte of the document in sequence, and changes the state of the machine to both record or emit the data encountered so far, as well as prepare to handle the next character. JSON can be read with a mere state machine, requiring no lookaheads or ambiguous states.

If there were no errors—no illegal characters encountered—the document is considered well formed.

However, there may be multiple ways to encode the same data, from the standpoint of the application. For example, JavaScript and JSON can encode strings multiple different ways, for example, "A" and "\u0041" form the same string; there is no difference to the running application (unless the application is introspecting its own source code—not recommended). If the application developer wises to express a preference for one or the other, this may be tested with a linter.

It is entirely the intention of JSON to allow multiple different ways of encoding the same data, and for these different encodings to be processed by the recipient application exactly the same way.

In some cases, this is necessary for application security. For example, you can backslash escape the slash character. This allows you to safely embed JSON inside text/html, by making it impossible to write a sequence that looks like a closing tag such as </script>—since it would be encoded as <\/script>.

JSON Schema does not offer a way to distinguish or require one form or the other; since according to the application, the data is the same. If you need to test for how JSON is written (for example, to ensure that output is obfuscated to not look like valid HTML), this is instead a task for a linter.

Types of Validation

Validation ensures that the document (including its data) is in a form suitable for consumption by the application. It can be broken down into two broad categories:

Syntactic validation ensures that the document parser can reach the end of the document fully and unambiguously able to understand its syntax. A document that conforms to the generic JSON syntax is considered well formed.

Semantic Validation ensures that the data within the document is within the boundaries that the application will understand. A document whose data meets the desired requirements is considered valid.

Since JSON Schema only operates on a valid JSON document, syntactic validation is implicitly part of JSON Schema validation.

Overlap of well-formed and valid

If you have many constraints on data, and your application requires a valid document (and not merely a well formed one), then it is sometimes possible to write the parser to do validation during parsing.

For example, a "coordinate" property may require an array with exactly two numbers, representing longitude and latitude. In this case, a JSON Schema aware parser may reject a string with a message such as "Unexpected double quote on line 2 character 20; expecting an array", as if it were an illegal character.

In this case, there is little runtime distinction between being well-formed and valid. There is an academic distinction, the parser must still be a full JSON parser that can accommodate object keys appearing in any order; but it is one that can terminate parsing as soon as it knows the output will not be valid.

Types of Semantic Validation

Structural

In short, structural validation is concerned with describing the very most limits on what a document must conform to, including values that might become sensible in the future. For example, Elbonia is not a country right now, but it's feasible that a country named Elbonia could form some time in the future. Therefore, it would not make sense to hard-code a list of countries as part of structural validation.

Formally, structural validation is concerned with placing assertions on a single value. For example, "Value is greater than zero", "Value is an array", "Value is nonempty", or "Value has property with key name".

Consistency

Consistency validation ensures that references between data makes sense. For example, "Value is an ID string found in the users database", "Value A is less than value B", or "Value is an ID number that must be distinct from other ID numbers".

There's multiple ways to verify data consistency: far too many mechanisms to incorporate into JSON Schema. Validating data consistency may involve:

  • Scanning the rest of the document for a referenced value
  • Making a network request for a referenced document
  • Querying a database for a record with the given value
  • Sending a message to the referenced address, and verifying it was received

In short, data consistency is tested by actually exercising the intended action. For looking up a user, you must actually query the users database. For testing an email address, you must actually send an email.

Inference

The primary job of an inference engine is to determine additional data about a resource based on what is known. Inference typically has some way to perform validation, which checks that there are no contradictions between pieces of data. For example, an assertion that a resource cannot be both a Cat and a Dog.

It is the type used by RDF Schema and RDF's OWL.

I/O-dependent tests vs. compute-heavy tests

An important performance characteristic of JSON Schema validation is that it can be tested without relying on compute-blocking I/O operations (such as networking or filesystem). Excluding data consistency tests from the scope of JSON Schema validation has the effect of excluding tests that rely on I/O. For example, the tests that have to check a database to verify that a referenced ID exists.

Not all data consistency tests rely on I/O, however. Many applications simply wish to test that an ID is defined in another part of the same document. Even though this case would be compute-bound, it is still outside the scope of JSON Schema validation for several reasons:

  1. JSON Schema validation does not assume that the entire JSON document must be buffered in memory. JSON Schema validation is designed to be compatible with streaming parsers, that retain a limited state.

  2. Supporting all the different ways that people can implement same-document references would still be extremely complicated. This is functionality best left to an actual scripting language (like Lua or ECMAScript), rather than re-designing a new language in JSON.

  3. JSON Schema should not express a preference on the best way to validate references to data. And the mere existence of keyword may encourage people to change the structure of their JSON documents to be able to use it. It would be unfortunate if an application started embedding more data into JSON documents, because JSON Schema only supported compute tests and not I/O-dependent tests.

Context

Different applications might have different validation constraints at different stages of the application. A few examples:

Regular Expressions

Strings within JSON Schema may be required to pass a regular expression. Validating a string against a regular expression is a kind of syntactic validation, and might keep a state, and earlier characters might change the outcome of parsing for later characters.

This is a warranted feature in JSON Schema because there is always going to be data inside a document that can be broken apart into multiple values, but is not well represented by JSON. For example, a date has year, month, and more components. There's already good standards for encoding dates as strings, inventing a JSON encoding for dates is not really necessary.

RPC

RPC calls for data update and retrieval are one case where the same resource might have different schemas applied to it at different points in its life-cycle. At creation time, the resource might exclude an ID, since it is to be assigned by the server. But for updates, an ID might be mandatory, so that the server knows which document in the collection to update. These technically call for two separate schemas.

Related article: https://martinfowler.com/bliki/CQRS.html

Annotations

JSON Schema can still be used to declare relationships between data; and if the document passes JSON Schema validation, then applications can opt-into performing additional tests based on the declared relationships in the data.

For example, a form generation library might look at a "range" property, and use it to auto-complete usernames from the database of legal names, and not permit form submission until it verifies the username actually exists.