-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transform the parser to CPS. Support pull parsing #51
base: master
Are you sure you want to change the base?
Conversation
The goal was to be able to do: > {continue, Continue, Events1} = erlsom:parse_pull(InitialData, ParserOpts), %% Events1 being a list of the events produced by erlsom so far. >{continue, Continue2, Events2} = Continue(MoreData), ... {ok, EventsN, RemainingData} = ContinueN(FinalData). That is, invert the control with the caller passing the data in chunks to the parser, instead of the parser asking for more in a callback. This allow simpler integration for long running parsing (think in infinite xml streams comming from network), and also allows hibernating the parsing process, that on current version is not always possible (as hibernate discard the stack). Returning the list of events instead of using sax callbacks is mostly a matter of preference, sax callbacks can still be used. A found returning a list cleaner to work with (and size of the list is bounded if you feed the parser in chunks) The change was to make every possible "blocking" parsing operation become CPS. Then the continueFun/* in erlsom_sax_lib check if we are using continuation callback or not. If not using continuation callback, it just return a function from where caller can continue parsing: continueFun2K(T, V1, V2, V3, State = #erlsom_sax_state{continuation_fun=undefined}, ParseFun, K) -> {continue, fun(Data) -> ParseFun(<<T/binary, Data/binary>>, V1, V2, V3, State#erlsom_sax_state{user_state = []}, K) end, lists:reverse(State#erlsom_sax_state.user_state)}; Initial test show little performance degradation, and any work done by the user with the parser events would make the difference in parsing costs irrelevant, but need more complete tests to assert that.
Hello Pablo, Very interesting. At first sight I would say that it might be even nicer if you could really I don't understand your point about hibernation, can you give an example? Regards, On Mon, Jan 4, 2016 at 5:33 PM, Pablo Polvorin [email protected]
|
Hello, thanks for your time!. Regarding hibernation, see the code at the end. -module(phibernation).
-compile(export_all).
send(Pid, []) ->
Pid ! done;
send(Pid, [D|Rest]) ->
Pid ! {data, D},
receive after 100 -> ok end, %%just to simulate a little delay. On cases this is minutes between each chunk comming from client side
io:format("\nParser process memory usage: ~p\n", [erlang:process_info(Pid, memory)]),
send(Pid, Rest).
test() ->
io:format("\n*** Normal run\n"),
test(fun() -> loop(false) end),
io:format("\n*** Hibernate run\n"),
test(fun() -> loop(true) end),
io:format("\n*** Sax with continuation_fun\n"),
test(fun() -> loop_with_callback(false) end),
io:format("\n*** Sax with continuation_fun, hibernate (fail)\n"),
test(fun() -> loop_with_callback(true) end),
ok.
test(LoopFun) ->
Data = [<<"<s:xstream xmlns:s='http://some-namespace.com' as='234' version='1.0'>">>,
<<"<a href='ab">>,
<<"c'><b>ss</b></a>">>,
<<"<a href='ab">>,
<<"c'><b>ss</b></a>">>],
Pid = spawn(LoopFun),
send(Pid, Data).
loop(ShouldHibernate) ->
loop(undefined, ShouldHibernate).
loop(Parser, ShouldHibernate) ->
receive
{data, Data} ->
{continue, Parser2, Events} = case Parser of
undefined -> erlsom:parse_pull(Data, [{output_encoding, utf8}]);
_ -> Parser(Data)
end,
io:format("Parsed ~p Events: ~p", [Data, Events]),
case ShouldHibernate of
%% Hibernate the process to reduce memory usage, since we expect that further data won't come in for a while
true -> erlang:hibernate(?MODULE, loop, [Parser2, ShouldHibernate]);
false -> loop(Parser2, ShouldHibernate)
end;
done ->
ok
end.
loop_with_callback(ShouldHibernate) ->
SaxCallbackState = [],
F = fun(Ev, _) -> io:format("~p" ,[Ev]) end,
Continuation = case ShouldHibernate of
true -> fun continuation_hibernated/2;
false -> fun continuation/2
end,
erlsom:parse_sax(<<>>, SaxCallbackState, F, [{continuation_function, Continuation, []}, {output_encoding, utf8}]).
continuation(Tail, State) ->
receive
{data, Data} ->
{<<Tail/binary, Data/binary>>, State}
end.
continuation_hibernated(Tail, State) ->
%% This won't work. erlang:hibernate discard the call stack, (no erlsom:parse_sax/4 is left) so when this process wakeups
%% continuation/2 will be called and the process ends.
erlang:hibernate(?MODULE, continuation, [Tail, State]). Results (events prints removed for clarity) :
One instead of hibernate could force GC, but isn't the same |
Hi Pablo, Just a short note to tell you that I did not forget about your idea. I agree that support for hibernation would be nice, and the idea of a continuation is also nice. But I still don't like the idea of getting a list of events, for me that goes against the idea of erlsom, which (in my mind) is to look at sax events one by one. So I would like to think about this a bit more when I have some more time. (To get a bit more feeling for the idea I wrote the small module below: this is a simple way to use the existing parser as the basis for a pull parser) -module(pull_parser).
-export([start/1, pull/1]).
%% export for use by this module only
-export([parse/1]).
%% starts a new process that parses up to the next event and then
%% waits until it receives a pull-request
start(Xml) ->
spawn(?MODULE, parse, [Xml]).
pull(Parser) ->
Parser ! {pull, self()},
receive
Event ->
Event
end.
%% internal functions
parse(Xml) ->
erlsom:parse_sax(Xml, [], fun callback/2).
callback(endDocument, State) ->
State;
callback(Event, State) ->
receive
{pull, Client} ->
Client ! Event,
State
end. Regards, |
I wonder if you will be willing to merge something in the lines of this patch.
(In such case this will need more tests and performance comparison)
The basic idea is:
For pull parsing, I mean:
That is, allow the user of the parser to pass the data in chunks to the parser, instead of the parser asking for more in a callback.
This allow simpler integration for long running parsing (think in infinite xml streams coming from network), and also allows hibernating the parsing process, that on current version is not always possible (as hibernate discard the stack).
Returning the list of events instead of using sax callbacks is mostly a matter of preference, sax callbacks can still be used. I found returning a list easier to work with (and size of the list is bounded
if you feed the parser in chunks).
The change was to make every possible "blocking" parsing operation be in continuation passing style. Then the continueFun/* in erlsom_sax_lib check if we are in a pull parsing or not. If we are a pull parser, instead of asking the user for more data, it just return a function from where caller can continue parsing:
(when pull parsing, the sax callbacks just accumulate the events in the user_state, so here we return them on the right order)
For transforming to CPS, the code is now allocating more funs while parsing. Initial test show little performance degradation, and any work done by the user with the parser events would make the difference in parsing costs irrelevant, but need more complete tests to assert that.
I only tested the utf8 backend, others should work the same.
What's your opinion?