Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to get canonical JSON output #1037

Open
digulla opened this issue May 26, 2023 · 15 comments
Open

Add support to get canonical JSON output #1037

digulla opened this issue May 26, 2023 · 15 comments

Comments

@digulla
Copy link
Contributor

digulla commented May 26, 2023

I often need to compare the output of some REST API endpoint against a "known correct result" file.

This allows me to share the file with business, QAs and customers to get early feedback. It also helps them to keep track of changes since they get to see the current result without having to deploy and run the application.

For this to be efficient, I need a way to update the file when the output changes.

Right now, I can use JsonNode.equals() to check a whole tree for any changes but that doesn't tell me exactly where and what those changes are.

I can create an ObjectMapper which sorts (most) keys and which serializes BigDecimal consistently but there are still several corner cases:

  • Type information gets serialized as the first field. This order is lost when parsing the JSON file again.
  • IDEs like Intelli/J love to change the formatting of the JSON and it's hard to replicate those rules with Jackson. So I'm reading the file, parse it into JSON and format it again before I can compare it. Main pain point here: The space before the colon in an object.
  • Getting the correct Jackson setup takes a dozen lines of code.

I think the ideal solution would be a factory method to build a CanonicalJsonGenerator. It must

  • Sort all keys in the JsonNode tree
  • Strip trailing zeros from BigDecimal and print them using toPlainString()
  • Enable pretty printing
  • Use LF as line separator

Optionally, it would be great to have a DefaultPrettyPrinter where _objectFieldValueSeparatorWithSpaces is just colon-space instead of space-colon-space. Maybe Separators should be extended to handle this case there.

I also like to convert date&time to ISO to make it easier to understand the values.

My current setup for reference is below. With this setup, I have to do this before I can compare the JSON strings:

  • Serialize my Java objects
  • Parse the output into JsonNode
  • Serialize the JsonNode again to fix the order of type fields created by JsonTypeInfo
class DefaultPrettyPrinterForTests : DefaultPrettyPrinter() {
    override fun withSeparators(separators: Separators): DefaultPrettyPrinter {
        _separators = separators
        _objectFieldValueSeparatorWithSpaces = separators.objectFieldValueSeparator + " "
        return this
    }
}

private static DefaultIndenter STABLE_INDENTER = new DefaultIndenter("    ", "\n")
private static DefaultPrettyPrinter PRETTY_PRINTER = new DefaultPrettyPrinterForTests()
            .withObjectIndenter(STABLE_INDENTER)

public static ObjectMapper forTests() {
    return Jackson2ObjectMapperBuilder()
        .featuresToEnable(
            /* Avoid 1.23445678E8 instead of 123445678 in the output */
            JsonGenerator.Feature.WRITE_BIGDECIMAL_AS_PLAIN,
            SerializationFeature.ORDER_MAP_ENTRIES_BY_KEYS,
            MapperFeature.SORT_PROPERTIES_ALPHABETICALLY
        )
        .featuresToDisable(
            /** Serialize all dates and times in ISO format */
            SerializationFeature.WRITE_DURATIONS_AS_TIMESTAMPS,
            SerializationFeature.WRITE_DATES_AS_TIMESTAMPS,
            SerializationFeature.WRITE_DATE_KEYS_AS_TIMESTAMPS,
            SerializationFeature.WRITE_DATES_WITH_CONTEXT_TIME_ZONE
        )
        .indentOutput(true)
        .postConfigurer( mapper ->
            mapper.setDefaultPrettyPrinter(PRETTY_PRINTER)
        )
        .build<ObjectMapper>()
}
@cowtowncoder
Copy link
Member

cowtowncoder commented May 26, 2023

First of all, yes, there is need for some sort of Canonical output.
Possibly (ideally?) using a well-defined, existing specification that outlines requirements -- I have vague recollection such exists (possible more than one, would need to choose).

I don't think this should be done at JsonGenerator level as its responsibility should be simple (?) encoding of token streams. It does not buffer anything (for JSON anyway).
Rather, something feeding it should do this: probably using Tree Model (JsonNode).
Could be either via ObjectMapper or add-on, no strong opinion.

Since ObjectNode does preserve ordering, could be something that simple transforms a JsonNode into canonical equivalent (creating new instances).

This could also be an extension module, possibly providing either new type(s), or simply overriding default serializer of JsonNode (etc) with custom one that transforms instances on the fly.

@digulla
Copy link
Contributor Author

digulla commented May 30, 2023

Re canonical JSON: I found this: https://gibson042.github.io/canonicaljson-spec/
Not sure how relevant it is.

I think this functionality doesn't belong inside of ObjectMapper since it's not related to Java <-> JSON mapping. It's a pure JSON functionality. Which is why I thought to return a new generator from JsonFactory since this isn't about JSON or JsonNode but about how the data is rendered.

So it's on top of JsonGenerator but below ObjectMapper (or to think of it in another way: It's useful without data binding).

I agree the sorting of the keys can be done outside of the generator in a preparation step. But the rest (handling of trailing zeros, pretty printing) is already handled in JsonGenerator.

As a developer, I would prefer something like JsonFactory.enable(CANONICAL_JSON).createGenerator(...) which then Just Works(TM). The goal here is to have a simple setup so when you have to change something because of bug reports, developers don't have to change their code. Also, this is the place where I would start looking for such a feature. From my point of view, this feels most natural.

If this is implemented this as a JsonGeneratorDelegate, developers can then add filtering and all the other good stuff.

The other option is a special CanonicalJsonFactory but I'm worried how much code duplication that would mean and how many other APIs would need changes.

Maybe a simple solution for the sorting: Let JsonNodeFactory build new children maps. That would allow to return a TreeMap or some custom implementation. Call the factory method in the constructor of ObjectNode. That would allow to create sorted copies by calling root.deepCopy().

If deepCopy() would delegate more to JsonNodeFactory, this would allow to strip trailing zeros for BigDecimals (unless you want to add a feature flag for this). Alternatively, deepCopy() could accept a NodePostProcessor but I'm not sure how well that would fit into your design.

All that's then missing is a CanonicalJsonPrettyPrinter.

@cowtowncoder
Copy link
Member

Well, from my perspective it certainly does not belong to Streaming API as it cannot be implemented in streaming manner.

But I disagree on it not fitting in jackson-databind, in the sense that JsonNode is already included there -- and that is not a binding really, but a JSON-centric object representation.
One could definitely argue if that should belong to either in jackson-core (loosening Streaming aspect) -- something many users would like, fwtw -- or in a new package.
But from purely practical perspective JsonNode won't be moving into jackson-core; and the most likely place for canonical handler would either be in jackson-databind or as a separate "module" (whether proper Jackson Module or just general "module") and repo.

So: to make it simple -- I will not add canonical JSON handling at level of JsonParser/JsonGenerator/JsonFactory. But I'd be happy to help support another approach to providing support for Canonical JSON.

Another thing to consider, too, is whether Canonical handling would be useful for other dataformats. I would assume that for some (binary JSON ones, CBOR and Smile) it would make sense; and for most others (CSV, XML, Protobuf etc) not.

@digulla
Copy link
Contributor Author

digulla commented May 30, 2023

Ah, now I get where you're at. My ideas were purely from a developer perspective. I was assuming JsonNode is part of core, so I never checked. I don't want to move JsonNode either, I'm looking for the cheapest solution (least amount of new code) :-)

What do you think about my proposal to move more code from ObjectNode to JsonNodeFactory so sorting and BigDecimal processing can be done with the help of a special JsonNodeFactory? There are already seveal comments in the code for ObjectNode to this effect.

That might have the additional benefit that it might be useful for other use cases.

The other option would be a tree visitor which then delegates to a JsonGenerator. That would use less memory and it would be very clean solution. I feel that it would be a lot of new code, though.

@cowtowncoder
Copy link
Member

Right, if JsonNode did exist in jackson-core things would be bit different.

Unfortunately no, JsonNodeFactory is purely for constructing "empty" instances; it is not designed to have any logic. Partly the issue is that it doesn't take in any configuration (which is probably and oversight).
But more fundamentally I guess JsonNodeFactory is used on deserialization, not serialization.

One could probably register a custom JsonSerializer for JsonNode (and/or ObjectNode), but serialization of JsonNode is handled by methods in JsonNode and subtypes that implement JsonSerializable. So default logic (which is quite simple TBH) is not really modular. Or at least less so than deserialization.

Delegates for JsonGenerator sound similarly like non-optimal as they'd need to re-construct ordering, adding buffering; it seems like backwards approach as well (although to users would be modular; just tricky and inefficient to implement).

I may be missing some other approaches.

@digulla
Copy link
Contributor Author

digulla commented Jun 4, 2023

This sounds like the best option is then to preconfigure MapperBuilder.

Right now, MapperBuilder isn't opinionated. Would you prefer a

  1. new method in there,
  2. an extension class CanonicalMapperBuilder
  3. a configurer (which takes an existing builder and applies presets)

The first one will add a lot of code in the existing class which doesn't really fit into the existing desing.

The second one isn't very flexible since you have to change your existing code to use the new type.

The last one is very flexible, isolates concerns but (as far as I can tell) would add a new concept.

@digulla
Copy link
Contributor Author

digulla commented Jun 4, 2023

I've started with unit tests and a bit of code. It's more ugly than I hoped. You can have a look here: FasterXML/jackson-databind#3963

I left TODOs in all places where something was unclear or I think the solution could be better.

One other thing that I just noticed: The pipeline tried to deploy the Java 8 build!
See https://github.com/digulla/jackson-databind/actions/runs/5168255885/jobs/9309721753

I hope this happens in my repo and not in the one from Jackson. Can I turn this off or should I just ignore it?

@cowtowncoder
Copy link
Member

@digulla Ok on deploy... that is weird. It should filter out PRs and not try deploy. Feel free to ignore it, it's not your fault; main pipeline does need to do better job.

@cowtowncoder
Copy link
Member

cowtowncoder commented Jul 26, 2023

@digulla One quick not on canonical JSON (from https://gibson042.github.io/canonicaljson-spec/) vs changes so far.
It seems to me that:

MUST NOT include insignificant (i.e., inter-token) whitespace 

would work without Pretty Printer, and the additional configuration that adds single white-space after : in Objects would make output not-Canonical wrt spec?
(but this is fine for purposes of "a" canonical representation purely for comparison purposes).

I suppose similarly the requirement to always force scientific notation (Spec) is different from your preferences.

This leads to something I realized: maybe Canonicalizer (however implemented) should have some configurability wrt specific commonly desired aspects?

@digulla
Copy link
Contributor Author

digulla commented Aug 8, 2023

I think there will be two main use cases:

  • Creating stable JSON output, for example to get stable hashes, compression results or builds.
  • Format JSON to compare two complex objects in a unit test

For the former, we want a compact format. For the latter, we want pretty printing.

Ideally, there should be a builder which presets everything for the two cases and then configuration methods or support configurers so developers can tweak the standard.

For example, I would really like to be able to configure the number formatter. There are several useful ways to present numbers:

  • Always exponential form: 1, 1.23E2, 1.45E-2
  • Always plain form: 1, 123, 0.0145
  • Exponential without decimal places: 1, 123, 145E-3
  • A binary IEEE format

The first one is required by the standard. I'm not sure how useful that format is in real life. The format is also prone to rounding errors while parsing.

The second form is easy to understand for human and it's also prone to rounding errors during parsing.

The third form is a compromise between readability and keeping rounding errors in check since you can first read the number as (big)integer and then just multiply by a power of 10.

The last form represents the number perfectly, is very fast to parse but maybe not portable between, say, Java and JavaScript.

That's why I'm trying to add a generic API to format&parse numbers to Jackson.

@cowtowncoder
Copy link
Member

Yeah I don't think there will ever be generic API for number formatting, since that inevitably would result in additional overhead and complexity.

Worse, since conversions between binary (float, double) and decimal (BigDecimal, textual JSON) representations are lossy, there are also problems with accuracy if attempting to unify handling.

If limiting canonical handling to be accessed using JsonNode things are slightly simpler as we can at least force use of BigDecimal and then probably force scientific/plain output.

Then again, if only using JsonNode it seems odd having to configure ObjectMapper.

I guess I am still not quite sure what a workable solution should look like.

@gabrieljones
Copy link

@gabrieljones
Copy link

gabrieljones commented Jan 9, 2025

Possibly (ideally?) using a well-defined, existing specification that outlines requirements -- I have vague recollection such exists (possible more than one, would need to choose).

RFC 8785 was mentioned here
Which is also here

@gabrieljones
Copy link

@cowtowncoder
Copy link
Member

@gabrieljones Yes, there are ways to get more canonical output, thank you for the link!

On the hand, this is not quite all there is to canonicalization, I think (yes, I wrote the blog post). But would be good to know missing pieces and especially if there is actual demand for more work/features. Esp. since some of the things may require low-level changes to things Unicode encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants