POC: implement `iter_annexworktree` with one run-context manager and two batch-context managers #520

christian-monch · 2023-10-27T07:55:23Z

This PR is not meant for merging. It is a POC to demonstrate how iter_annexworktree can be implemented with run-context-managers and batch-context-managers in the following fashion:

with \
        run(['git', 'ls-files', ...], ...) as git_ls_files, \
        annexjson_batchcommand(['git', 'annex', 'examinekey', ...], ...) as examine_key, \
        annexline_batchcommand(['git', 'annex', 'lookupkey', ...], ...) as lookup_key:
    for item in git_ls_files:
        annex_key = lookup_key(item.name + '\n')
        if annex_key:
            key_properties = examine_key(annex_key)
            yield AnnexWorkTreeItem(item, key_properties)
        else:
            yield AnnexWorkTreeItem(item, None)

It also introduces a wrapper around the result iterators that will trigger a termination if an iteration step is longer then a given timeout.

Note: the provided git-ls-files implementation is not as complete as the "original", since this is a POC. It might not work on some corner cases. So for large scale performance comparisons at little more work is most likely to be done.

mih · 2023-10-27T10:05:01Z

datalad_next/iter_collections/annexworktree.py

+# The following protocol is part of the POC implementation. It allows a simple
+# `with run()`-commandline that returns `GitWorkTreeItem`-objects or
+# `GitWorkTreeFileSystemItem`-objects. In the real implementation we would
+# probably use `iter_gitworktree()`.


This raises interesting points for me:

yes, we would definitely use iter_gitworktree(), but iter_gitworktree() could be implemented this way for max alignment

GitLsFilesProtocol should probably not sit on StdOutCaptureGeneratorProtocol directly, but on an intermediate protocol that implements the zero-byte-delimited-items output, or even delimited-items output. Right now the protocol assume that ls-files is called correctly but in the line

run(['git', 'ls-files', '-z', '--stage', '--cached'], protocol_class=GitLsFilesProtocol...

I cannot immediately see this requirement/implication.

it seems we should/could aim for assembling a library of such common protocols

This raises interesting points for me:

yes, we would definitely use iter_gitworktree(), but iter_gitworktree() could be implemented this way for max alignment

That's what I thought as well. I was tempted to do that. But I refrained, in order to focus on the annex-batch-related parts.

GitLsFilesProtocol should probably not sit on StdOutCaptureGeneratorProtocol directly, but on an intermediate protocol that implements the zero-byte-delimited-items output, or even delimited-items output. Right now the protocol assume that ls-files is called correctly but in the line

run(['git', 'ls-files', '-z', '--stage', '--cached'], protocol_class=GitLsFilesProtocol...

I cannot immediately see this requirement/implication.

it seems we should/could aim for assembling a library of such common protocols

The, now-closed, big-runner PR with data_processors has a protocol, i.e. StdOutErrCaptureProcessingGeneratorProtocol, that passes all data through a data processor pipeline. The pipeline is configured by the protocol keyword argument processors.

With this concept, the iter_gitworktree call would look something like this:

with run( ['git', 'ls-files', '-z', '--stage', '--cached'], protocol_class=StdOutCaptureProcessingGeneratorProtocol protocol_kwargs={'processors': [decode_processor('utf-8'), splitlines_processor('\x00', False), gititem_processor]} ) as git_ls_files: pass

The gititem_processor does not yet exist, but it just converts a line of git ls-files output and probably some pending data into a GitWorkTreeItem.
The run-call could of course be encapsulated into a specialized function, reducing the call to something like this

with git_ls_files(path=path, ...) as gls: ...

mih · 2023-10-27T10:09:06Z

I left a first comment. I also tried to run it on my standard lots-of-files dataset. But it seems to hang there:

from datalad_next.iter_collections.annexworktree import iter_annexworktree
res = list(iter_annexworktree('.', untracked='all', link_target=False))
<Ctrl-C> after some time

File ~/env/datalad-dev/lib/python3.11/site-packages/datalad/runner/nonasyncrunner.py:637, in ThreadedRunner.process_queue(self)
    627 while True:
    628     # We do not need a user provided timeout here. If
    629     # self.timeout is None, no timeouts are reported anyway.
   (...)
    634     # `ThreadedRunner.process_check_interval`, to check whether the
    635     # process is still running.
    636     try:
--> 637         file_number, state, data = self.output_queue.get(
    638             timeout=ThreadedRunner.timeout_resolution)
    639         break
    640     except Empty:

File /usr/lib/python3.11/queue.py:180, in Queue.get(self, block, timeout)
    178         if remaining <= 0.0:
    179             raise Empty
--> 180         self.not_empty.wait(remaining)
    181 item = self._get()
    182 self.not_full.notify()

File /usr/lib/python3.11/threading.py:324, in Condition.wait(self, timeout)
    322 else:
    323     if timeout > 0:
--> 324         gotit = waiter.acquire(True, timeout)
    325     else:
    326         gotit = waiter.acquire(False)

KeyboardInterrupt:

christian-monch · 2023-10-27T11:45:20Z

I left a first comment. I also tried to run it on my standard lots-of-files dataset. But it seems to hang there:

from datalad_next.iter_collections.annexworktree import iter_annexworktree
res = list(iter_annexworktree('.', untracked='all', link_target=False))
<Ctrl-C> after some time
[...]
File /usr/lib/python3.11/threading.py:324, in Condition.wait(self, timeout)
    322 else:
    323     if timeout > 0:
--> 324         gotit = waiter.acquire(True, timeout)
    325     else:
    326         gotit = waiter.acquire(False)

KeyboardInterrupt:

Thanks for the test. Weird, it seems to not be able to get the lock for the queue...
Please not that link_target is ignored in the POC (due to the run-context-based iter_gitworkdir implementation)

Edit: ~~I will take a look at that. Not sure what triggers it, but it might be a thread that crashed while holding a lock.~~

I think the problem might have been a bug in the batchcommand context-handler. If you actually got an queue.Empty-exception, the "stdin closing" would not be executed. I updated the code, it executes on my example datasets.

You should nevertheless have gotten a timeout after ca. 4 seconds!? Weird! Which dataset do you use?

EDIT: maybe it just takes long? I took 3:10 minutes on my machine to iterate over ca. 33.000 files

christian-monch · 2023-10-27T13:09:59Z

I used the iterator on a dataset with 33.000 entries and it took about 3:10 minutes on my development machine.

If you interrupt the execution with you are very likely to end up in the queue-lock acquire code, so that has no specific significance.

mih · 2023-10-27T13:27:28Z

I used the iterator on a dataset with 33.000 entries and it took about 3:10 minutes on my development machine.

If you interrupt the execution with you are very likely to end up in the queue-lock acquire code, so that has no specific significance.

I did not wait for 3min, indeed. As you can see in dafafb3 I would expect a runtime of substantially below 10s.

christian-monch · 2023-10-27T15:20:37Z

I used the iterator on a dataset with 33.000 entries and it took about 3:10 minutes on my development machine.
If you interrupt the execution with you are very likely to end up in the queue-lock acquire code, so that has no specific significance.

I did not wait for 3min, indeed. As you can see in dafafb3 I would expect a runtime of substantially below 10s.

It takes between 2 and 3 seconds to do the git-ls-files iteration only (comment out batch processing).

I did the following test:

> git ls-files >/tmp/glf-out
> time git annex lookupkey --batch < /tmp/glf-out >/tmp/alk-out

real    2m45,258s
user    1m16,367s
sys     1m22,510s

And this one:

> time git annex examinekey --batch --json < /tmp/alk-out >/tmp/aek-out

real    0m2,382s
user    0m2,111s
sys     0m0,261s

So, looking up a large number of keys is too slow. I will create a version that uses a run-context for git ls-files, a run-context for git annex find, and a batchcommand-context for git annex examinekey.

christian-monch · 2023-10-27T15:58:30Z

I pushed 2913989. It uses two run-contexts to read from git ls-files and git annex find at the "same" time. It only uses a batchcommand-context for git annex examinekey. git annex lookupkey is gone.

This version tries to keep the maximum number of stored results low while having a single pass over both run-context iterators (the actual number is dependent on the ordering of names that are returned from the two run-contexts).

The overall time is around 14 seconds. It is more than 10 seconds, but a lot less than 3:15 minutes.

The code is not as simple anymore though :-/

It should be possible to do the same thing with iter_gitworktree instead of using a run-context. This commit focused on the two run-contexts and their "concurrent" use.

Co-authored-by: Michael Hanke <[email protected]> This commit adds a `batchcommand`-context-manager that keeps track of a subprocess and supports batch-style communication. That means, it allows to send a request to the subprocess and will return the response from the subprocess. How the output of the subprocess has to be interpreted to yield a response is defined by the protocol that is provided to the batchcommand-context-manager. The batchcommand-context-manager uses the run-context-manager to ensure that the exit-handler waits for process termination. The commit also contains three convenience functions that provide context-manager for common interaction patterns (common in `datalad` and `datalad_next` anyways): 1. `stdout_batchcommand`: this context-manager writes to `stdin` of the subprocess and returns the next, yet unread, chunk of bytes that was written to `stdout` of the subprocess. 2. `stdoutline_batchcommand`: this context-manager writes to `stdin` of the subprocess and returns the next, yet unread, line that was written to `stdout` of the subprocess. Line-separators and whether they are included in the output can be chosen when creating the context-manager. 3. `annexjson_batchcommand`: this context-mananger writes to `stdin` of the subprocess and returns a JSON-object, that is created from the next, yet unread, line that is written to `stdout` of the subprocess. This context-mamanger is geared towards git-annex batch commands with JSON-result output, i.e. git-annex invocations with the parameters `--batch`, `--json`, `--json-error-messages`. The tests provide an example for protocol that handles output that varies in size, i.e. output of python-programs. The class that implements the protocol is called `PythonProtocol`.

Draft of `iter_annexworktree()` Performance observations (annexrepo with 36k annexed files): ``` from datalad_next.iter_collections.annexworktree import iter_annexworktree %timeit res = list(iter_annexworktree('.', untracked='all', link_target=False)) 4.22 s ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit res = list(iter_annexworktree('.', untracked=None, link_target=False)) 4.33 s ± 195 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit res = list(iter_annexworktree('.', untracked=None, link_target=True)) 10.6 s ± 228 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Closes datalad#459

This commit add tests that document the difference in runtime between processing n lines intertwined, i.e. write on line and read one line, and processing n lines in bulk, i.e. write n lines and read n lines.

codecov · 2023-11-13T18:16:28Z

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (0c7da95) 19.73% compared to head (909a430) 92.41%.

Additional details and impacted files

@@                      Coverage Diff                       @@
##           minimal-run-batch-context     #520       +/-   ##
==============================================================
+ Coverage                      19.73%   92.41%   +72.67%     
==============================================================
  Files                            129      134        +5     
  Lines                           9175     9764      +589     
  Branches                        1054     1084       +30     
==============================================================
+ Hits                            1811     9023     +7212     
+ Misses                          7326      722     -6604     
+ Partials                          38       19       -19

Files	Coverage Δ
...t/iter_collections/tests/test_iterannexworktree.py	`100.00% <100.00%> (ø)`
datalad_next/runners/__init__.py	`100.00% <ø> (ø)`
datalad_next/runners/batch.py	`100.00% <100.00%> (ø)`
datalad_next/runners/tests/test_batch.py	`100.00% <100.00%> (ø)`
datalad_next/iter_collections/annexworktree.py	`98.03% <98.03%> (ø)`
datalad_next/runners/protocols.py	`94.28% <89.47%> (+44.28%)`	⬆️
datalad_next/commands/ls_file_collection.py	`93.04% <33.33%> (+53.62%)`	⬆️

... and 112 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mih · 2023-11-23T07:46:56Z

With #538 merged, we are betting on a new horse for subprocess execution. I believe this PR can be closed and/or reposted in rejuvenated form.

christian-monch changed the title ~~POC: implement iter annexworktree with one run-context manager and two batch-context managers~~ POC: implement iter_annexworktree with one run-context manager and two batch-context managers Oct 27, 2023

mih reviewed Oct 27, 2023

View reviewed changes

mih mentioned this pull request Oct 27, 2023

Revival of #469 to test #484 #497

Closed

christian-monch force-pushed the iter-annexworktree-run-and-batch branch 2 times, most recently from 2719e2e to 0062a57 Compare October 28, 2023 19:16

christian-monch force-pushed the minimal-run-batch-context branch 2 times, most recently from f0f5f50 to 0c7da95 Compare November 6, 2023 20:02

christian-monch force-pushed the iter-annexworktree-run-and-batch branch from 72628ee to 569e45a Compare November 6, 2023 20:23

christian-monch force-pushed the iter-annexworktree-run-and-batch branch from 569e45a to 1938fd0 Compare November 6, 2023 20:27

christian-monch added 2 commits November 13, 2023 18:39

add a small improvement to iter_annexworktree

5b94fdb

add performance indicating test for batch

909a430

This commit add tests that document the difference in runtime between processing n lines intertwined, i.e. write on line and read one line, and processing n lines in bulk, i.e. write n lines and read n lines.

christian-monch force-pushed the iter-annexworktree-run-and-batch branch from 8da1f08 to 909a430 Compare November 13, 2023 17:43

mih mentioned this pull request Nov 15, 2023

Alternative paradigm for concurrent batch processes #537

Closed

mih closed this Nov 23, 2023

christian-monch deleted the iter-annexworktree-run-and-batch branch July 16, 2024 10:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: implement `iter_annexworktree` with one run-context manager and two batch-context managers #520

POC: implement `iter_annexworktree` with one run-context manager and two batch-context managers #520

christian-monch commented Oct 27, 2023 •

edited

Loading

mih Oct 27, 2023

christian-monch Oct 27, 2023 •

edited

Loading

mih commented Oct 27, 2023

christian-monch commented Oct 27, 2023 •

edited

Loading

christian-monch commented Oct 27, 2023 •

edited

Loading

mih commented Oct 27, 2023

christian-monch commented Oct 27, 2023 •

edited

Loading

christian-monch commented Oct 27, 2023 •

edited

Loading

codecov bot commented Nov 13, 2023 •

edited

Loading

mih commented Nov 23, 2023

POC: implement iter_annexworktree with one run-context manager and two batch-context managers #520

POC: implement iter_annexworktree with one run-context manager and two batch-context managers #520

Conversation

christian-monch commented Oct 27, 2023 • edited Loading

mih Oct 27, 2023

Choose a reason for hiding this comment

christian-monch Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

mih commented Oct 27, 2023

christian-monch commented Oct 27, 2023 • edited Loading

christian-monch commented Oct 27, 2023 • edited Loading

mih commented Oct 27, 2023

christian-monch commented Oct 27, 2023 • edited Loading

christian-monch commented Oct 27, 2023 • edited Loading

codecov bot commented Nov 13, 2023 • edited Loading

Codecov Report

mih commented Nov 23, 2023

POC: implement `iter_annexworktree` with one run-context manager and two batch-context managers #520

POC: implement `iter_annexworktree` with one run-context manager and two batch-context managers #520

christian-monch commented Oct 27, 2023 •

edited

Loading

christian-monch Oct 27, 2023 •

edited

Loading

christian-monch commented Oct 27, 2023 •

edited

Loading

christian-monch commented Oct 27, 2023 •

edited

Loading

christian-monch commented Oct 27, 2023 •

edited

Loading

christian-monch commented Oct 27, 2023 •

edited

Loading

codecov bot commented Nov 13, 2023 •

edited

Loading