-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: implement iter_annexworktree
with one run-context manager and two batch-context managers
#520
Conversation
iter_annexworktree
with one run-context manager and two batch-context managers
# The following protocol is part of the POC implementation. It allows a simple | ||
# `with run()`-commandline that returns `GitWorkTreeItem`-objects or | ||
# `GitWorkTreeFileSystemItem`-objects. In the real implementation we would | ||
# probably use `iter_gitworktree()`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises interesting points for me:
-
yes, we would definitely use
iter_gitworktree()
, butiter_gitworktree()
could be implemented this way for max alignment -
GitLsFilesProtocol
should probably not sit onStdOutCaptureGeneratorProtocol
directly, but on an intermediate protocol that implements the zero-byte-delimited-items output, or even delimited-items output. Right now the protocol assume thatls-files
is called correctly but in the linerun(['git', 'ls-files', '-z', '--stage', '--cached'], protocol_class=GitLsFilesProtocol...
I cannot immediately see this requirement/implication.
-
it seems we should/could aim for assembling a library of such common protocols
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises interesting points for me:
- yes, we would definitely use
iter_gitworktree()
, butiter_gitworktree()
could be implemented this way for max alignment
That's what I thought as well. I was tempted to do that. But I refrained, in order to focus on the annex-batch-related parts.
GitLsFilesProtocol
should probably not sit onStdOutCaptureGeneratorProtocol
directly, but on an intermediate protocol that implements the zero-byte-delimited-items output, or even delimited-items output. Right now the protocol assume thatls-files
is called correctly but in the linerun(['git', 'ls-files', '-z', '--stage', '--cached'], protocol_class=GitLsFilesProtocol...
I cannot immediately see this requirement/implication.
it seems we should/could aim for assembling a library of such common protocols
The, now-closed, big-runner PR with data_processors has a protocol, i.e. StdOutErrCaptureProcessingGeneratorProtocol
, that passes all data through a data processor pipeline. The pipeline is configured by the protocol keyword argument processors
.
With this concept, the iter_gitworktree
call would look something like this:
with run(
['git', 'ls-files', '-z', '--stage', '--cached'],
protocol_class=StdOutCaptureProcessingGeneratorProtocol
protocol_kwargs={'processors': [decode_processor('utf-8'), splitlines_processor('\x00', False), gititem_processor]}
) as git_ls_files:
pass
The gititem_processor
does not yet exist, but it just converts a line of git ls-files
output and probably some pending data into a GitWorkTreeItem
.
The run-call could of course be encapsulated into a specialized function, reducing the call to something like this
with git_ls_files(path=path, ...) as gls:
...
I left a first comment. I also tried to run it on my standard lots-of-files dataset. But it seems to hang there: from datalad_next.iter_collections.annexworktree import iter_annexworktree
res = list(iter_annexworktree('.', untracked='all', link_target=False))
<Ctrl-C> after some time
File ~/env/datalad-dev/lib/python3.11/site-packages/datalad/runner/nonasyncrunner.py:637, in ThreadedRunner.process_queue(self)
627 while True:
628 # We do not need a user provided timeout here. If
629 # self.timeout is None, no timeouts are reported anyway.
(...)
634 # `ThreadedRunner.process_check_interval`, to check whether the
635 # process is still running.
636 try:
--> 637 file_number, state, data = self.output_queue.get(
638 timeout=ThreadedRunner.timeout_resolution)
639 break
640 except Empty:
File /usr/lib/python3.11/queue.py:180, in Queue.get(self, block, timeout)
178 if remaining <= 0.0:
179 raise Empty
--> 180 self.not_empty.wait(remaining)
181 item = self._get()
182 self.not_full.notify()
File /usr/lib/python3.11/threading.py:324, in Condition.wait(self, timeout)
322 else:
323 if timeout > 0:
--> 324 gotit = waiter.acquire(True, timeout)
325 else:
326 gotit = waiter.acquire(False)
KeyboardInterrupt: |
Thanks for the test. Weird, it seems to not be able to get the lock for the queue... Edit: I think the problem might have been a bug in the You should nevertheless have gotten a timeout after ca. 4 seconds!? Weird! Which dataset do you use? EDIT: maybe it just takes long? I took 3:10 minutes on my machine to iterate over ca. 33.000 files |
I used the iterator on a dataset with 33.000 entries and it took about 3:10 minutes on my development machine. If you interrupt the execution with you are very likely to end up in the queue-lock acquire code, so that has no specific significance. |
I did not wait for 3min, indeed. As you can see in dafafb3 I would expect a runtime of substantially below 10s. |
It takes between 2 and 3 seconds to do the git-ls-files iteration only (comment out batch processing). I did the following test:
And this one:
So, looking up a large number of keys is too slow. I will create a version that uses a run-context for |
I pushed 2913989. It uses two run-contexts to read from This version tries to keep the maximum number of stored results low while having a single pass over both run-context iterators (the actual number is dependent on the ordering of names that are returned from the two run-contexts). The overall time is around 14 seconds. It is more than 10 seconds, but a lot less than 3:15 minutes. The code is not as simple anymore though :-/ It should be possible to do the same thing with |
2719e2e
to
0062a57
Compare
f0f5f50
to
0c7da95
Compare
Co-authored-by: Michael Hanke <[email protected]> This commit adds a `batchcommand`-context-manager that keeps track of a subprocess and supports batch-style communication. That means, it allows to send a request to the subprocess and will return the response from the subprocess. How the output of the subprocess has to be interpreted to yield a response is defined by the protocol that is provided to the batchcommand-context-manager. The batchcommand-context-manager uses the run-context-manager to ensure that the exit-handler waits for process termination. The commit also contains three convenience functions that provide context-manager for common interaction patterns (common in `datalad` and `datalad_next` anyways): 1. `stdout_batchcommand`: this context-manager writes to `stdin` of the subprocess and returns the next, yet unread, chunk of bytes that was written to `stdout` of the subprocess. 2. `stdoutline_batchcommand`: this context-manager writes to `stdin` of the subprocess and returns the next, yet unread, line that was written to `stdout` of the subprocess. Line-separators and whether they are included in the output can be chosen when creating the context-manager. 3. `annexjson_batchcommand`: this context-mananger writes to `stdin` of the subprocess and returns a JSON-object, that is created from the next, yet unread, line that is written to `stdout` of the subprocess. This context-mamanger is geared towards git-annex batch commands with JSON-result output, i.e. git-annex invocations with the parameters `--batch`, `--json`, `--json-error-messages`. The tests provide an example for protocol that handles output that varies in size, i.e. output of python-programs. The class that implements the protocol is called `PythonProtocol`.
72628ee
to
569e45a
Compare
Draft of `iter_annexworktree()` Performance observations (annexrepo with 36k annexed files): ``` from datalad_next.iter_collections.annexworktree import iter_annexworktree %timeit res = list(iter_annexworktree('.', untracked='all', link_target=False)) 4.22 s ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit res = list(iter_annexworktree('.', untracked=None, link_target=False)) 4.33 s ± 195 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit res = list(iter_annexworktree('.', untracked=None, link_target=True)) 10.6 s ± 228 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Closes datalad#459
569e45a
to
1938fd0
Compare
This commit add tests that document the difference in runtime between processing n lines intertwined, i.e. write on line and read one line, and processing n lines in bulk, i.e. write n lines and read n lines.
8da1f08
to
909a430
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## minimal-run-batch-context #520 +/- ##
==============================================================
+ Coverage 19.73% 92.41% +72.67%
==============================================================
Files 129 134 +5
Lines 9175 9764 +589
Branches 1054 1084 +30
==============================================================
+ Hits 1811 9023 +7212
+ Misses 7326 722 -6604
+ Partials 38 19 -19
☔ View full report in Codecov by Sentry. |
With #538 merged, we are betting on a new horse for subprocess execution. I believe this PR can be closed and/or reposted in rejuvenated form. |
This PR is not meant for merging. It is a POC to demonstrate how
iter_annexworktree
can be implemented with run-context-managers and batch-context-managers in the following fashion:It also introduces a wrapper around the result iterators that will trigger a termination if an iteration step is longer then a given timeout.
Note: the provided git-ls-files implementation is not as complete as the "original", since this is a POC. It might not work on some corner cases. So for large scale performance comparisons at little more work is most likely to be done.