-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
streaming ndjson input and streaming flattened ndjson output (#53) #69
base: dev
Are you sure you want to change the base?
Conversation
The user query is compiled and the resulting body is placed within a function definition. The last statement, if an expression, is wrapped in a Return statement. The resulting function is compiled/exec'ed to create a function object. This function is then called. The user query may now contain return, yield, and yield from statements. If the user query contains yield or yield from statements the resulting function is a generator. If the function returns a iterator or is a generator the results will be placed in a list before serializing to json.
option -F will flatten the output, formatting and printing each item of the query's list, iterator, or generator output as a distinct line of newline-delimited json. Writing data is streamed so that the output data set is not held in memory in its entirety. An error occurs if the returned output is not a list or iterator, or if the user's query does not yield as a generator.
Option -S reads data from stdin/files using existing mechanisms, as newline-delimited json. "_" is an iterator providing the json values. Note that json parsing errors will not raise until running the user query, or until formatting and printing output in the case that output is streamed and flattened using -F. Due to refactoring the order of loading data and performing certain checks has changed. As a result users may see different error messages than in the past.
First of all, thank you very much for tackling this! I think this takes One thing I'm noticing when playing around with this is that the I'll investigate more, but other than the streaming aspect, do you think there is that big of a difference? If not, maybe we just enhance
Basically just make |
elif isinstance(response, list): | ||
it = iter(response) | ||
else: | ||
raise TypeError('-F/flatten requires the query to return an iterator/generator or list') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just output the object or scalar here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no technical reason this can't be done. In my opinion it suggests an error: the user requested flattening the output and then returned a value that can't be flattened.
There are some surprising results.
"jello -F" whose query returns 1
would output
1
"jello -F" whose query returns [1]
would flatten and similarly output
1
In my mind this is a question of what is most intuitive and least likely to confuse or produce an unexpected result. For example, I considered letting the user return an iterable in addition to an iterator (so anything providing iter() to produce an iterator) but this would lead to confusing situations such as returning the string, whose iterator provides each distinct character (returning "asdf" provides lines with each of "a", "s", "d", and "f").
Similarly, I want to avoid the user trying to stream but accidentally materializing the entire result in memory (they use -S/-F and then start swapping or oom). To that end maybe allowing the query to return a list is a bad idea and it should always be an iterator (including a generator).
So it's really about what we want to prioritize: try to allow the greatest breadth of return types to just work, or making things as explicit as possible and avoiding expected pitfalls at the expense of a bit more apparent complexity and perhaps being a little less intuitive.
This is also where doing something like #67 splits the difference -- it changes the semantics so that the query is called multiple times, always receiving and returning scalars. But that shift away from a single call with the entire input being in "_" and the returned value containing the entire output will also confuse.
I'm going back and forth while writing this (which I think you are too). My opinion at this point is that if it's streaming then the output must be an iterator, including a generator. Remove even a list. It'll be tedious for cases where the user wants to return a single value but can be made explicit with a good error message including the recommendation to yield this one value.
But I'm not convinced. I just want to weight the implication before introducing many options with complex semantics.
Been ruminating on this a bit. Actually thinking about simplifying even further so that |
Having a single streaming mode makes sense to me. Honestly splitting it was convenient while writing the code and to organize this PR, but I agree having fewer options is probably better. It someone wants to separate input and output so that their entries aren't 1:1 it's still not that hard -- just yield once per output and if you want to return a scalar yield that one scalar. Other considerations:
|
#53
Please read the commits separately first to associate the parts that were changed with the functionality they provide.
Note that I did refactor a bit. I tried to leave things alone where possible and clean up where I had to change things. Please do let me know if you want to change anything (e.g., I added some custom Exceptions to be able to catch them specifically but this differs from other error and exception handling; I added a couple classes in lib which is a bit of a deviation from surrounding style).
Changes in behavior:
Other notes:
this is incompatible with -R
For streaming input with -S, -R is ignored (but could be changed to stream raw lines of text if desired)
For streaming output with -F, -r appears to work. In general I used all existing formatting code. There are no tests for this combination.