Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(contrib): Start guidelines for schema design #15037

Merged
merged 1 commit into from
Jan 9, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/doc/contrib/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
- [Architecture](./implementation/architecture.md)
- [New packages](./implementation/packages.md)
- [New subcommands](./implementation/subcommands.md)
- [Data Schemas](./implementation/schemas.md)
- [Console Output](./implementation/console.md)
- [Filesystem](./implementation/filesystem.md)
- [Formatting](./implementation/formatting.md)
Expand Down
47 changes: 47 additions & 0 deletions src/doc/contrib/src/implementation/schemas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Data Schemas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have an overview of what this "data schemas" refer to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some context


Cargo reads and writes user and machine facing data formats, like
- `Cargo.toml`, read and written on `cargo package`
- `Cargo.lock`, read and written
- `.cargo/config.toml`, read-only
- `cargo metadata` output
- `cargo build --message-format` output

## Schema Design

Generally,
- Fields should be kebab case
- `#[serde(rename_all = "kebab-case")]` should be applied defensively
- Fields should only be present when needed, saving space and parse time
- Also, we can always switch to always outputting the fields but its harder to stop outputting them
- `#[serde(skip_serializing_if = "Default::default")]` should be applied liberally
- For output, prefer [jsonlines](https://jsonlines.org/) as it allows streaming output and flexibility to mix content (e.g. adding diagnostics to output that didn't prevously have it
- `#[serde(deny_unknown_fields)]` should not be used to allow evolution of formats, including feature gating

## Schema Evolution Strategies
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reading this, each bullet point is a bit disconnected to me.
I can't tell what it means exactly by evolution strategies. Maybe it is a pretty common term but I didn't know?

Also like "When reading data, add new fields", I had a hard time interpreting it as part of evolution strategies. For "Versioned with the edition" I am able to guess it means "When a schema needs to change some existing field, versioning it with edition", though the whole paragraph still looks pretty sparse and needs some guesses for me to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to expand on it a bit more. Hope it helps!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That is good enough for me as a maintainer. Not sure about casual contributors, but we can always iterate later :)


When changing a schema for data that is read, some options include:
- Adding new fields is relatively safe
- If the field must not be ignored when present,
have a transition period where it is invalid to use on stable Cargo before stabilizing it or
error if its used before supported within the schema version
(e.g. `edition` requires a minimum `package.rust-version`, if present)
- Adding new values to a field is relatively safe
- Unstable values should fail on stable Cargo
- Version the structure and interpretation of the data (e.g. the `edition` field or `package.resolver` which has an `edition` fallback)

Note: some formats that are read are also written back out
(e.g. `cargo package` generating a `Cargo.toml` file)
and those strategies need to be considered as well.

When changing a schema for data that is written, some options include:
- Add new fields if the presence can be ignored
- Infer permission from the users use of the new schema (e.g. a new alias for an `enum` variant)
- Version the structure and interpretation of the format
- Defaulting to the latest version with a warning that behavior may change (e.g. `cargo metadata --format-version`, `edition` in cargo script)
- Defaulting to the first version, eventually warning the user of the implicit stale behavior (e.g. `package.edition` in `Cargo.toml`)
- Without a default (e.g. `package.rust-version`, or a command-line flag like `--format-version`)

Note: While `serde` makes it easy to support data formats that add new fields,
new data types or supported values for a field are more difficult to future-proof
against.
Loading