Skip to content

Scope of JSON Schema Validation

Austin Wright edited this page Dec 16, 2019 · 12 revisions

This page fleshes out some of the considerations on what is in scope for JSON Schema Validation, and relationships to other features of JSON Schema.

Validation vs. linting

Across the range of documents that can possibly be parsed, there's a variety of passes that may need to be done over the document in order to parse it for information.

In the simplest form, a state machine consumes each character of a document. At the end of the document, the state of the parser will provide the application with the data necessary for it to do its job. If there were no errors—no illegal characters encountered—the document is considered valid.

However, there may be multiple ways to encode the same data, from the standpoint of the application. For example, JavaScript can encode the truthy boolean as either true or TRUE; there is usually no difference at execution time. If the application developer wises to express a preference for one or the other, this may be tested with a linter.

Types of Validation

Validation ensures that the document (including its data) is in a form suitable for consumption by the application. It can be broken down into two broad categories:

Syntactic validation ensures that the document parser can reach the end of the document fully and unambiguously able to understand its syntax.

Semantic Validation ensures that the data within the document is within the boundaries that the application will understand.

Types of Semantic Validation

Structural

I believe this term was invented by JSON Schema.

In short, structural validation is concerned with describing the very most limits on what a document must conform to, including values that might become sensible in the future. For example, Elbonia is not a country right now, but it's feasible that a country named Elbonia could form some time in the future. Therefore, it would not make sense to hard-code a list of countries as part of structural validation.

Formally, structural validation is concerned with placing assertions on a single value. For example, "Value is greater than zero", "Value is an array", "Value is nonempty", or "Value has property with key name".

Consistency

Consistency validation ensures that references between data make sense. For example, "Value is an ID string found in the users database", "Value A is less than value B", or "Value is an ID number that must be distinct from other ID numbers".

There's multiple ways to verify data consistency: far too many mechanisms to incorporate into JSON Schema. Validating data consistency may involve:

  • Scanning the rest of the document for a referenced value
  • Making a network request for a referenced document
  • Querying a database for a record with the given value
  • Sending a message to the referenced address, and verifying it was received

In short, data consistency is tested by actually exercising the intended action. For looking up a user, you must actually query the users database. For testing an email address, you must actually send an email.

Inference

The primary job of an inference engine is to determine additional data about a resource based on what is known. Inference typically has some way to perform validation, which checks that there are no contradictions between pieces of data. For example, a resource cannot be both a Cat and a Dog.

It is the type used by RDF Schema.

Context

Different applications might have different validation constraints at different stages of the application. A few examples:

Regular Expressions

Strings within JSON Schema may be required to pass a regular expression. Validating a string against a regular expression is a kind of syntactic validation, and might keep a state, and earlier characters might change the outcome of parsing for later characters.

This is a warranted feature in JSON Schema because there is always going to be data inside a document that can be broken apart into multiple values, but is not well represented by JSON. For example, a date has year, month, and more components. There's already good standards for encoding dates as strings, inventing a JSON encoding for dates is not really necessary.

RPC

RPC calls for data update and retrieval are one case where the same resource might have different schemas applied to it at different points in its life-cycle. At creation time, the resource might exclude an ID, since it is to be assigned by the server. But for updates, an ID might be mandatory, so the server knows which document in the collection to update. These technically call for two separate schemas.

Related article: https://martinfowler.com/bliki/CQRS.html

Annotations

JSON Schema can still be used to declare relationships between data; verifying that the data is consistent falls outside the scope of a typical validator, and onto applications that must look for and process declared data relationships.

For example, a form generation library might look at a "range" property, and use it to auto-complete usernames from the database of legal names.