Add support for deserializing list-encoded JSON structs [#6558] #6643

jagill · 2024-10-29T15:17:54Z

Which issue does this PR close?

Closes #6558.

Rationale for this change

Currently, a StructArray can only be deserialized from a JSON object (e.g. {a: 1, b: "c"}), but some services (e.g. Presto and Trino) encode ROW types as JSON lists (e.g. [1, "c"]) because this is more compact, and the schema is known.
Arrow-json cannot currently deserialize these.

What changes are included in this PR?

This PR adds the ability to parse JSON lists into StructArrays, if the StructParseMode is set to ListOnly. In ListOnly mode, object-encoded structs raise an error. Setting to ObjectOnly (the default) has the original parsing behavior.

Are there any user-facing changes?

Users may set the StructParsingMode enum to ListOnly to parse list-style structs. The associated enum,
variants, and method have been documented. I'm happy to update any other documentation.

Discussion topics

I've made a JsonParseMode struct instead of a bool flag for two reasons. One is that it's self-descriptive (what would true be?), and the other is that it allows a future Mixed mode that could deserialize either. The latter isn't currently requested by anyone.
I kept the error messages as similar to the old messages as possible. I considered having more specific error messages (like "Encountered a '[' when parsing a Struct, but the StructParseMode is ObjectOnly" or similar), but wanted to hear opinions before I went that route.
I'm not attached to any name/code-style/etc, so happy to modify to fit local conventions.
One requirement was that benchmarks do not regress. My running of benchmarks have been inconclusive (see https://gist.github.com/jagill/6749248171a1f12fb7c653ff70c5ed42). There are often small regressions or improvements in the single-digit % range whenever I switch between master and this PR. I suspect they are statistical but I wanted to note these.

tustvold · 2024-11-08T14:59:37Z

Sorry this is on my list to look into, but I've struggled to find time to look into it yet... I appreciate your patience and restraint from repeatedly tagging me (unlike some people are wont to do).

jagill · 2024-12-05T19:35:48Z

Are there any changes I could make to this PR to make it easier to review?

alamb

Thank you for this contribution @jagill and i very much apologize for the delay in reviewing -- we are always looking for more help:

Looking for help maintaining (reviewing PRs) in arrow/parquet/object_store #6418

TLDR thoughts:

Are you planning / willing to implement serializing support too (aka support in json_writer)? That would then allow us to verify that data can be round-tripped
I believe this PR slows down performance (notes below)

API

I've made a JsonParseMode struct instead of a bool flag for two reasons. One is that it's self-descriptive (what would true be?), and the other is that it allows a future Mixed mode that could deserialize either. The latter isn't currently requested by anyone.

I agree that a struct is nice for the reasons you mention

Errors

I kept the error messages as similar to the old messages as possible. I considered having more specific error messages (like "Encountered a '[' when parsing a Struct, but the StructParseMode is ObjectOnly" or similar), but wanted to hear opinions before I went that route.

I think improved error messages are always better, but we can do them as a follow on PR

I'm not attached to any name/code-style/etc, so happy to modify to fit local conventions.

💯

Performance

One requirement was that benchmarks do not regress. My running of benchmarks have been inconclusive (see https://gist.github.com/jagill/6749248171a1f12fb7c653ff70c5ed42). There are often small regressions or improvements in the single-digit % range whenever I switch between master and this PR. I suspect they are statistical but I wanted to note these.

I am not sure what benchmarks you ran -- I think the most relevant ones are in arrow/src/json_reader.rs

I ran them like cargo bench --bench json_reader

Here are three back to back runs I did on my test bench:

++ critcmp master json-struct-from-list
group                    json-struct-from-list                  master
-----                    ---------------------                  ------
large_bench_primitive    1.17      4.2±0.05ms        ? ?/sec    1.00      3.6±0.01ms        ? ?/sec
small_bench_list         1.03     14.5±0.05µs        ? ?/sec    1.00     14.1±0.17µs        ? ?/sec
small_bench_primitive    1.00      7.2±0.04µs        ? ?/sec    1.00      7.2±0.02µs        ? ?/sec

group                    json-struct-from-list                  master
-----                    ---------------------                  ------
large_bench_primitive    1.16      4.2±0.03ms        ? ?/sec    1.00      3.6±0.01ms        ? ?/sec
small_bench_list         1.03     14.5±0.05µs        ? ?/sec    1.00     14.0±0.13µs        ? ?/sec
small_bench_primitive    1.00      7.2±0.03µs        ? ?/sec    1.01      7.2±0.03µs        ? ?/sec

++ critcmp master json-struct-from-list
group                    json-struct-from-list                  master
-----                    ---------------------                  ------
large_bench_primitive    1.15      4.2±0.01ms        ? ?/sec    1.00      3.7±0.01ms        ? ?/sec
small_bench_list         1.00     14.4±0.10µs        ? ?/sec    1.00     14.4±0.08µs        ? ?/sec
small_bench_primitive    1.00      7.1±0.03µs        ? ?/sec    1.06      7.6±0.05µs        ? ?/sec
++ popd
~/datafusion-benchmarking

It shows a pretty consistent slowdown in large_bench_primitive -- I left some suggestions on how to improve this

Again, sorry for the delay in reviewing

alamb · 2024-12-17T19:59:19Z

arrow-json/src/reader/struct_array.rs

@@ -71,42 +75,70 @@ impl ArrayDecoder for StructArrayDecoder {
            .then(|| BooleanBufferBuilder::new(pos.len()));

        for (row, p) in pos.iter().enumerate() {


Performance wise, this code now has another conditional in the hot loop (each row)

for (row, p) in pos.iter().enumerate() { if self.struct_parse_mode { .. } }

I suspect the code would be fast (and easier to understand / cleaner) if you hoisted the check, something like

if self.struct_parse_mode { for (row, p) in pos.iter().enumerate() { .. } else { for (row, p) in pos.iter().enumerate() { .. } }

alamb · 2024-12-17T20:01:18Z

arrow-json/src/reader/mod.rs

@@ -170,12 +170,30 @@ mod struct_array;
 mod tape;
 mod timestamp_array;

+/// Specifies what is considered valid JSON when parsing StructArrays.


Do you have a link to documentation / prior art for this type of JSON encoding of structs? Maybe to trino / presto docs that describe it in more full detail that we can add to this doc?

Sadly, the documentation of the format of the returned data is a little lacking. They describe that the top-level row is a list with one entry per column (https://prestodb.io/docs/current/develop/client-protocol.html#important-queryresults-attributes, https://trino.io/docs/current/develop/client-protocol.html#important-queryresults-attributes), but do not specify how any structural elements are serialized (or non-structural things like Datetimes etc). If it helps, I could add this to the Presto documentation.

jagill · 2024-12-18T16:47:54Z

Ok, I've hoisted the conditional outside the loop. It also had the effect (as you predicted) that the gnarly match statement for the delimiters is a bit easier to understand.

Are you planning / willing to implement serializing support too (aka support in json_writer)? That would then allow us to verify that data can be round-tripped

I was not planning to, because there was no use-case requiring it. But I'm certainly willing to, both for symmetry and for round-tripping. Is that something you'd want in this PR, or a follow-up?

I believe this PR slows down performance (notes below)

When I check, I'm consistently seeing no difference between the main branch and either of the variants:

❯ critcmp main inner-conditional
group                    inner-conditional                      main
-----                    -----------------                      ----
large_bench_primitive    1.00      2.7±0.01ms        ? ?/sec    1.00      2.7±0.03ms        ? ?/sec
small_bench_list         1.01      8.4±0.09µs        ? ?/sec    1.00      8.2±0.06µs        ? ?/sec
small_bench_primitive    1.01      4.6±0.03µs        ? ?/sec    1.00      4.6±0.04µs        ? ?/sec

❯ critcmp main outer-conditional
group                    outer-conditional                      main
-----                    -----------------                      ----
large_bench_primitive    1.00      2.7±0.03ms        ? ?/sec    1.00      2.7±0.03ms        ? ?/sec
small_bench_list         1.00      8.3±0.11µs        ? ?/sec    1.00      8.2±0.06µs        ? ?/sec
small_bench_primitive    1.00      4.6±0.05µs        ? ?/sec    1.00      4.6±0.04µs        ? ?/sec

❯ critcmp inner-conditional outer-conditional
group                    outer-conditional                      inner-conditional
-----                    -----------------                      -----------------
large_bench_primitive    1.00      2.7±0.03ms        ? ?/sec    1.00      2.7±0.01ms        ? ?/sec
small_bench_list         1.00      8.3±0.11µs        ? ?/sec    1.01      8.4±0.09µs        ? ?/sec
small_bench_primitive    1.00      4.6±0.05µs        ? ?/sec    1.01      4.6±0.03µs        ? ?/sec

(btw, I didn't know about critcmp; thanks for the pointer!)

However, I'm not running in an isolated environment, so the differences I saw in the past seemed to be related to other processes. I'm running on a M2 Mac and haven't tried on with x86 or other architectures, and haven't been able to isolate from the helpful random processes that Macos spins up, so I'll trust your benchmarks.

alamb · 2024-12-24T14:57:09Z

Thank you very much again @jagill for this contribution.

Sorry for the delay in reviewing / that this PR is somewhat stuck

From my perspective having partial support for a feature (list encoded json structs) is non ideal and will increase the maintenance burden if we are not careful. I think it runs the risk of being a little used / unknown feature.

So in order to move this forward I think we need:

Additional documentation explaining the feature along with its rationale and an example. Perhaps as an example on https://docs.rs/arrow-json/54.0.0/arrow_json/reader/struct.ReaderBuilder.html -- for rationale you could repurpose some of the comments / dicussion in this PR linked to the trino docs, etc
File a ticket (maybe even prepare a PR) to track the work to add writing this version of JSON

jagill · 2024-12-30T01:17:12Z

@alamb: With this update I've

included the hoisting of the conditional above the loop, which I missed last time due to atrophied git add muscles,
renamed and extracted StructParseMode to a top level StructMode in arrow_json, since I've...
enabled writing structs to JSON lists in the writer,
included some more examples in the with_struct_mode methods of ReaderBuilder and WriterBuilder, along with links to the scanty Presto/Trino documentation,
added some round-trip tests in arrow-json/lib.rs, which seem lonely because I can't find others but I didn't see a better place to put them.

One thing that makes me sad is that the Writer has an explicit_nulls field which is not orthogonal to StructMode. In particular, writing to a JSON List must include nulls, so explicit_nulls == false and StructMode::ListOnly is not a consistent state. I saw three main approaches here:

Split StructMode::ObjectOnly to ObjectOnlyWithImplictNulls and ObjectOnlyWithExplicitNulls (but with better names). This means only valid states will exist, and the Reader can choose between accepting implicit nulls or not. But this is a breaking change.
Error/panic in WriterBuilder if explicit_nulls == false and struct_mode == ListOnly. This prevents users from using a non-consistent config, but adds failure paths.
Only skip nulls if explicit_nulls == false and struct_mode == ObjectOnly, and add that to the documentation for with_explicit_nulls and with_struct_mode. But this could surprise users who don't read the documentation.

I chose 3 but in theory I'm open to any of these (or other suggestions). In practice, I think that 1 would involve too much yak shaving, and we can always do the breaking change combining the two options later.

I'm also happy to add more documentation for rational or examples, with the following caveats:

I wanted to add a doctest but constructing an expected StructArray was so verbose it obfuscated the purpose of the doctest. Maybe there's a better way than I know?
Getting the doc update PR accepted/merged into Presto and Trino will take an unknown amount of time. Maybe a week, maybe 3 months. I'm reticent to block this update for that.

alamb · 2024-12-30T11:07:40Z

Awesoe -- thank you @jagill - I will try and review this PR later today or tomorrow

arrow-json/src/lib.rs

arrow-json/src/reader/mod.rs

arrow-json/src/writer/mod.rs

arrow-json/src/writer/encoder.rs

arrow-json/src/reader/struct_array.rs

arrow-json/src/reader/mod.rs

jagill · 2025-01-09T23:21:41Z

@Jefffrey I addressed your comments; thanks for the feedback!

Jefffrey

Sorry for the delay.

Generally looks good to me; my main concern is how introducing this new feature couples other builder options together (i.e. struct_mode with explicit_nulls and strict_mode). I guess having this documented is the best approach for now considering we don't want to break API compatibility until next major release.

arrow-json/src/lib.rs

arrow-json/src/reader/mod.rs

Jefffrey · 2025-01-19T09:30:16Z

arrow-json/src/reader/struct_array.rs

-                (TapeElement::Null, Some(nulls)) => {
-                    nulls.append(false);
-                    continue;
+        match self.struct_mode {


Maybe add a note to explain why this duplication occurs, e.g.

// We avoid having the match on self.struct_mode inside the hot loop for performance

And then could add a TODO to investigate how to reduce the duplication if possible.

Re: wrong type decoding: I wrote them but left them sloppily in test_struct_decoding_list_length. I've extracted them and extracted most of the logic of the parsing closure into a function.

Re: option overlapping: Yeah, without a breaking API change, I see three options:

Describe that strict_mode and explicit_nulls are effectively determined by struct_mode in documentation. (As is)

Document that the "wrong" settings for strict_mode/explicit_nulls and struct_mode is not allowed; panic in that case.

Explicitly set strict_mode/explicit_nulls when struct_mode is set, and checkl when struct_mode/explicit_nulls is set. And document.

I think 3 is weird and probably leaves a footgun for later developers. I don't have a strong opinion between either 1 or 2.

Jefffrey

Generally looks good to me. We'll see if can get another opinion on the API design (possibly @alamb or @tustvold?)

Re: option overlapping: Yeah, without a breaking API change, I see three options:

Describe that strict_mode and explicit_nulls are effectively determined by struct_mode in documentation. (As is)

Document that the "wrong" settings for strict_mode/explicit_nulls and struct_mode is not allowed; panic in that case.

Explicitly set strict_mode/explicit_nulls when struct_mode is set, and checkl when struct_mode/explicit_nulls is set. And document.

I think 3 is weird and probably leaves a footgun for later developers. I don't have a strong opinion between either 1 or 2.

I also lean towards option 1 (current state of the PR); for future we could consider patterns like type-state builders when not constrained by not having breaking changes to the API.

arrow-json/src/reader/mod.rs

Currently, a StructArray can only be deserialized from or serialized to a JSON object (e.g. `{a: 1, b: "c"}`), but some services (e.g. Presto and Trino) encode ROW types as JSON lists (e.g. `[1, "c"]`) because this is more compact, and the schema is known. This PR adds the ability to encode and decode JSON lists from and to StructArrays, if StructMode is set to ListOnly. In ListOnly mode, object-encoded structs raise an error. Setting to ObjectOnly (the default) has the original parsing behavior. Some notes/questions/points for discussion: 1. I've made a JsonParseMode struct instead of a bool flag for two reasons. One is that it's self-descriptive (what would `true` be?), and the other is that it allows a future Mixed mode that could deserialize either. The latter isn't currently requested by anyone. 2. I kept the error messages as similar to the old messages as possible. I considered having more specific error messages (like "Encountered a '[' when parsing a Struct, but the StructParseMode is ObjectOnly" or similar), but wanted to hear opinions before I went that route. 3. I'm not attached to any name/code-style/etc, so happy to modify to fit local conventions. Fixes apache#6558

Jefffrey · 2025-01-26T22:35:36Z

Thanks @jagill & @alamb

Jefffrey · 2025-01-26T22:55:49Z

Generally looks good to me. We'll see if can get another opinion on the API design (possibly @alamb or @tustvold?)

Re: option overlapping: Yeah, without a breaking API change, I see three options:

Describe that strict_mode and explicit_nulls are effectively determined by struct_mode in documentation. (As is)

Document that the "wrong" settings for strict_mode/explicit_nulls and struct_mode is not allowed; panic in that case.

Explicitly set strict_mode/explicit_nulls when struct_mode is set, and checkl when struct_mode/explicit_nulls is set. And document.

I think 3 is weird and probably leaves a footgun for later developers. I don't have a strong opinion between either 1 or 2.

I also lean towards option 1 (current state of the PR); for future we could consider patterns like type-state builders when not constrained by not having breaking changes to the API.

Raised #7025 for future reference

github-actions bot added the arrow Changes to the arrow crate label Oct 29, 2024

jagill force-pushed the json-struct-from-list branch from 8fc7592 to 7181f92 Compare October 29, 2024 15:20

jagill mentioned this pull request Oct 30, 2024

Allow JSON deserialization of StructArray from JSON List #6558

Closed

alamb reviewed Dec 17, 2024

View reviewed changes

jagill force-pushed the json-struct-from-list branch from 7181f92 to c3b5fc9 Compare December 18, 2024 16:24

jagill force-pushed the json-struct-from-list branch 2 times, most recently from 5449cba to 38d3d88 Compare December 30, 2024 01:13

Jefffrey reviewed Jan 7, 2025

View reviewed changes

jagill force-pushed the json-struct-from-list branch from 38d3d88 to 3adabed Compare January 9, 2025 23:21

jagill force-pushed the json-struct-from-list branch from 3adabed to e6e4031 Compare January 11, 2025 08:43

Jefffrey reviewed Jan 19, 2025

View reviewed changes

jagill force-pushed the json-struct-from-list branch from e6e4031 to 72cb314 Compare January 20, 2025 18:32

Jefffrey approved these changes Jan 21, 2025

View reviewed changes

arrow-json/src/reader/mod.rs Outdated Show resolved Hide resolved

jagill force-pushed the json-struct-from-list branch from 72cb314 to 9de27ba Compare January 21, 2025 15:43

Jefffrey merged commit 000353d into apache:main Jan 26, 2025
23 checks passed

Jefffrey mentioned this pull request Jan 26, 2025

Enhance arrow-json reader/writer builder pattern to enforce constraints #7025

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for deserializing list-encoded JSON structs [#6558] #6643

Add support for deserializing list-encoded JSON structs [#6558] #6643

jagill commented Oct 29, 2024

tustvold commented Nov 8, 2024 •

edited

Loading

jagill commented Dec 5, 2024 •

edited

Loading

alamb left a comment

alamb Dec 17, 2024

alamb Dec 17, 2024

jagill Dec 18, 2024

jagill commented Dec 18, 2024

alamb commented Dec 24, 2024

jagill commented Dec 30, 2024 •

edited

Loading

alamb commented Dec 30, 2024

jagill commented Jan 9, 2025

Jefffrey left a comment

Jefffrey Jan 19, 2025

jagill Jan 20, 2025

Jefffrey left a comment

Jefffrey commented Jan 26, 2025

Jefffrey commented Jan 26, 2025

		@@ -71,42 +75,70 @@ impl ArrayDecoder for StructArrayDecoder {
		.then(\|\| BooleanBufferBuilder::new(pos.len()));

		for (row, p) in pos.iter().enumerate() {

Add support for deserializing list-encoded JSON structs [#6558] #6643

Add support for deserializing list-encoded JSON structs [#6558] #6643

Conversation

jagill commented Oct 29, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Discussion topics

tustvold commented Nov 8, 2024 • edited Loading

jagill commented Dec 5, 2024 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

API

Errors

Performance

alamb Dec 17, 2024

Choose a reason for hiding this comment

alamb Dec 17, 2024

Choose a reason for hiding this comment

jagill Dec 18, 2024

Choose a reason for hiding this comment

jagill commented Dec 18, 2024

alamb commented Dec 24, 2024

jagill commented Dec 30, 2024 • edited Loading

alamb commented Dec 30, 2024

jagill commented Jan 9, 2025

Jefffrey left a comment

Choose a reason for hiding this comment

Jefffrey Jan 19, 2025

Choose a reason for hiding this comment

jagill Jan 20, 2025

Choose a reason for hiding this comment

Jefffrey left a comment

Choose a reason for hiding this comment

Jefffrey commented Jan 26, 2025

Jefffrey commented Jan 26, 2025

tustvold commented Nov 8, 2024 •

edited

Loading

jagill commented Dec 5, 2024 •

edited

Loading

jagill commented Dec 30, 2024 •

edited

Loading