Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streaming ndjson input and streaming flattened ndjson output (#53) #69

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

stevenpelley
Copy link

#53

Please read the commits separately first to associate the parts that were changed with the functionality they provide.

Note that I did refactor a bit. I tried to leave things alone where possible and clean up where I had to change things. Please do let me know if you want to change anything (e.g., I added some custom Exceptions to be able to catch them specifically but this differs from other error and exception handling; I added a couple classes in lib which is a bit of a deviation from surrounding style).

Changes in behavior:

  • user query allows return, yield, and yield from statements. The last statement is returned if it is an expression.
  • if the user query returns an iterator or is a generator (it yields) this returned result will be wrapped in a list and serialized as json.
  • -F to flatten and stream output as ndjson. An error occurs if the user query output is not a list or iterator (including generator). If it is an iterator items are taken and immediately formatted, making memory usage independent of the size of the output.
  • -S to stream ndjson input. An error occurs if the user query input is not ndjson. "_" of the user query is an iterator providing the json items. The input is read and parsed only as the user query reads from the iterator, making memory usage independent of the size of the input.
  • There may be unintended changes to errors (although tests pass) and when using -S json loading errors will occur late in execution as loading/parsing is deferred. Data may be output before a JSON Loading error occurs.

Other notes:
this is incompatible with -R
For streaming input with -S, -R is ignored (but could be changed to stream raw lines of text if desired)
For streaming output with -F, -r appears to work. In general I used all existing formatting code. There are no tests for this combination.

The user query is compiled and the resulting body is placed within
a function definition.  The last statement, if an expression, is wrapped
in a Return statement.  The resulting function is compiled/exec'ed to
create a function object.  This function is then called.

The user query may now contain return, yield, and yield from statements.
If the user query contains yield or yield from statements the resulting
function is a generator.

If the function returns a iterator or is a generator the results will be
placed in a list before serializing to json.
option -F will flatten the output, formatting and printing each item of the
query's list, iterator, or generator output as a distinct line of
newline-delimited json.  Writing data is streamed so that the output data
set is not held in memory in its entirety.

An error occurs if the returned output is not a list or iterator, or if the
user's query does not yield as a generator.
Option -S reads data from stdin/files using existing mechanisms, as
newline-delimited json.  "_" is an iterator providing the json values.

Note that json parsing errors will not raise until running the user query, or
until formatting and printing output in the case that output is streamed and
flattened using -F.

Due to refactoring the order of loading data and performing certain
checks has changed.  As a result users may see different error messages
than in the past.
@stevenpelley stevenpelley mentioned this pull request Dec 5, 2024
@kellyjonbrazil
Copy link
Owner

kellyjonbrazil commented Dec 7, 2024

First of all, thank you very much for tackling this! I think this takes jello to the next level.

One thing I'm noticing when playing around with this is that the -F option seems very similar to -l, except -l is a little more like -Fc.

I'll investigate more, but other than the streaming aspect, do you think there is that big of a difference? If not, maybe we just enhance -l to be the new -F so we don't need to add another option. If that makes sense, then maybe we should make it act like -l so it puts every object/scalar on its own line.

% echo '[{"a":1},{"a":"\\n"},1,2,3,4,"\\n",5]' | jello -Fc
{"a":1}
{"a":"\n"}
1
2
3
4
"\n"
5

% echo '[{"a":1},{"a":"\\n"},1,2,3,4,"\\n",5]' | jello -l 
{"a":1}
{"a":"\n"}
1
2
3
4
"\n"
5

Basically just make -l streaming.

elif isinstance(response, list):
it = iter(response)
else:
raise TypeError('-F/flatten requires the query to return an iterator/generator or list')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just output the object or scalar here instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no technical reason this can't be done. In my opinion it suggests an error: the user requested flattening the output and then returned a value that can't be flattened.

There are some surprising results.
"jello -F" whose query returns 1 would output

1

"jello -F" whose query returns [1] would flatten and similarly output

1

In my mind this is a question of what is most intuitive and least likely to confuse or produce an unexpected result. For example, I considered letting the user return an iterable in addition to an iterator (so anything providing iter() to produce an iterator) but this would lead to confusing situations such as returning the string, whose iterator provides each distinct character (returning "asdf" provides lines with each of "a", "s", "d", and "f").

Similarly, I want to avoid the user trying to stream but accidentally materializing the entire result in memory (they use -S/-F and then start swapping or oom). To that end maybe allowing the query to return a list is a bad idea and it should always be an iterator (including a generator).

So it's really about what we want to prioritize: try to allow the greatest breadth of return types to just work, or making things as explicit as possible and avoiding expected pitfalls at the expense of a bit more apparent complexity and perhaps being a little less intuitive.

This is also where doing something like #67 splits the difference -- it changes the semantics so that the query is called multiple times, always receiving and returning scalars. But that shift away from a single call with the entire input being in "_" and the returned value containing the entire output will also confuse.

I'm going back and forth while writing this (which I think you are too). My opinion at this point is that if it's streaming then the output must be an iterator, including a generator. Remove even a list. It'll be tedious for cases where the user wants to return a single value but can be made explicit with a good error message including the recommendation to yield this one value.

But I'm not convinced. I just want to weight the implication before introducing many options with complex semantics.

@kellyjonbrazil
Copy link
Owner

Been ruminating on this a bit. Actually thinking about simplifying even further so that -S puts jello into "streaming mode", which means both input and output will stream at the same time. With this code it basically means -S behaves as -SFc. I would change the -F behavior so it could also output scalors and non-iterator objects. Still thinking on this, but seems like it might make sense.

@stevenpelley
Copy link
Author

Been ruminating on this a bit. Actually thinking about simplifying even further so that -S puts jello into "streaming mode", which means both input and output will stream at the same time. With this code it basically means -S behaves as -SFc. I would change the -F behavior so it could also output scalors and non-iterator objects. Still thinking on this, but seems like it might make sense.

Having a single streaming mode makes sense to me. Honestly splitting it was convenient while writing the code and to organize this PR, but I agree having fewer options is probably better. It someone wants to separate input and output so that their entries aren't 1:1 it's still not that hard -- just yield once per output and if you want to return a scalar yield that one scalar.
As per my reply in a previous comment I'd argue for requiring the query returned value to be an iterator/generator (no list, no scalar). I'm sure this will confuse someone but also gives an opportunity via error messages to guide the user (e.g., if you stream and return a list you're materializing the results and this probably isn't what you want to do).

Other considerations:

  • I like requiring "-c" and not making it implicit in a streaming option for a couple reasons. This keeps it consistent with other options and with jq (not strictly necessary but it's nice to keep things consistent and we might assume, possibly incorrectly, that they had good reasons for organizing their options the way they did). There are cases where I do want to pretty print (that is, not use "-c") even when streaming: passing the filter through "less" to find a known document and then examine it; store stream output to a file to diff multiple streams -- diffing is much easier to see on pretty-printed json.
  • There are some advanced features that interact oddly with streaming. The most significant is the availability of "" in .jelloconf.py when streaming -- "" is an iterator, so if running .jelloconf.py advances the iterator these values are no longer available while running the user query. I have no idea how many people use this ability and at that point people are on their own, but it might be worth not touching "-l" as this would break for existing calls. Up to you how strictly we observe semver and maintain backwards compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants