Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: implement iter_annexworktree with one run-context manager and two batch-context managers #520

Conversation

christian-monch
Copy link
Contributor

@christian-monch christian-monch commented Oct 27, 2023

This PR is not meant for merging. It is a POC to demonstrate how iter_annexworktree can be implemented with run-context-managers and batch-context-managers in the following fashion:

with \
        run(['git', 'ls-files', ...], ...) as git_ls_files, \
        annexjson_batchcommand(['git', 'annex', 'examinekey', ...], ...) as examine_key, \
        annexline_batchcommand(['git', 'annex', 'lookupkey', ...], ...) as lookup_key:
    for item in git_ls_files:
        annex_key = lookup_key(item.name + '\n')
        if annex_key:
            key_properties = examine_key(annex_key)
            yield AnnexWorkTreeItem(item, key_properties)
        else:
            yield AnnexWorkTreeItem(item, None)

It also introduces a wrapper around the result iterators that will trigger a termination if an iteration step is longer then a given timeout.

Note: the provided git-ls-files implementation is not as complete as the "original", since this is a POC. It might not work on some corner cases. So for large scale performance comparisons at little more work is most likely to be done.

@christian-monch christian-monch changed the title POC: implement iter annexworktree with one run-context manager and two batch-context managers POC: implement iter_annexworktree with one run-context manager and two batch-context managers Oct 27, 2023
Comment on lines 35 to 38
# The following protocol is part of the POC implementation. It allows a simple
# `with run()`-commandline that returns `GitWorkTreeItem`-objects or
# `GitWorkTreeFileSystemItem`-objects. In the real implementation we would
# probably use `iter_gitworktree()`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises interesting points for me:

  • yes, we would definitely use iter_gitworktree(), but iter_gitworktree() could be implemented this way for max alignment

  • GitLsFilesProtocol should probably not sit on StdOutCaptureGeneratorProtocol directly, but on an intermediate protocol that implements the zero-byte-delimited-items output, or even delimited-items output. Right now the protocol assume that ls-files is called correctly but in the line

      run(['git', 'ls-files', '-z', '--stage', '--cached'], protocol_class=GitLsFilesProtocol...
    

    I cannot immediately see this requirement/implication.

  • it seems we should/could aim for assembling a library of such common protocols

Copy link
Contributor Author

@christian-monch christian-monch Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises interesting points for me:

  • yes, we would definitely use iter_gitworktree(), but iter_gitworktree() could be implemented this way for max alignment

That's what I thought as well. I was tempted to do that. But I refrained, in order to focus on the annex-batch-related parts.

  • GitLsFilesProtocol should probably not sit on StdOutCaptureGeneratorProtocol directly, but on an intermediate protocol that implements the zero-byte-delimited-items output, or even delimited-items output. Right now the protocol assume that ls-files is called correctly but in the line

      run(['git', 'ls-files', '-z', '--stage', '--cached'], protocol_class=GitLsFilesProtocol...
    

    I cannot immediately see this requirement/implication.

  • it seems we should/could aim for assembling a library of such common protocols

The, now-closed, big-runner PR with data_processors has a protocol, i.e. StdOutErrCaptureProcessingGeneratorProtocol, that passes all data through a data processor pipeline. The pipeline is configured by the protocol keyword argument processors.

With this concept, the iter_gitworktree call would look something like this:

with run(
    ['git', 'ls-files', '-z', '--stage', '--cached'],
    protocol_class=StdOutCaptureProcessingGeneratorProtocol
    protocol_kwargs={'processors': [decode_processor('utf-8'), splitlines_processor('\x00', False), gititem_processor]}
) as git_ls_files:
    pass

The gititem_processor does not yet exist, but it just converts a line of git ls-files output and probably some pending data into a GitWorkTreeItem.
The run-call could of course be encapsulated into a specialized function, reducing the call to something like this

with git_ls_files(path=path, ...) as gls:
    ...

@mih
Copy link
Member

mih commented Oct 27, 2023

I left a first comment. I also tried to run it on my standard lots-of-files dataset. But it seems to hang there:

from datalad_next.iter_collections.annexworktree import iter_annexworktree
res = list(iter_annexworktree('.', untracked='all', link_target=False))
<Ctrl-C> after some time

File ~/env/datalad-dev/lib/python3.11/site-packages/datalad/runner/nonasyncrunner.py:637, in ThreadedRunner.process_queue(self)
    627 while True:
    628     # We do not need a user provided timeout here. If
    629     # self.timeout is None, no timeouts are reported anyway.
   (...)
    634     # `ThreadedRunner.process_check_interval`, to check whether the
    635     # process is still running.
    636     try:
--> 637         file_number, state, data = self.output_queue.get(
    638             timeout=ThreadedRunner.timeout_resolution)
    639         break
    640     except Empty:

File /usr/lib/python3.11/queue.py:180, in Queue.get(self, block, timeout)
    178         if remaining <= 0.0:
    179             raise Empty
--> 180         self.not_empty.wait(remaining)
    181 item = self._get()
    182 self.not_full.notify()

File /usr/lib/python3.11/threading.py:324, in Condition.wait(self, timeout)
    322 else:
    323     if timeout > 0:
--> 324         gotit = waiter.acquire(True, timeout)
    325     else:
    326         gotit = waiter.acquire(False)

KeyboardInterrupt: 

@christian-monch
Copy link
Contributor Author

christian-monch commented Oct 27, 2023

I left a first comment. I also tried to run it on my standard lots-of-files dataset. But it seems to hang there:

from datalad_next.iter_collections.annexworktree import iter_annexworktree
res = list(iter_annexworktree('.', untracked='all', link_target=False))
<Ctrl-C> after some time
[...]
File /usr/lib/python3.11/threading.py:324, in Condition.wait(self, timeout)
    322 else:
    323     if timeout > 0:
--> 324         gotit = waiter.acquire(True, timeout)
    325     else:
    326         gotit = waiter.acquire(False)

KeyboardInterrupt: 

Thanks for the test. Weird, it seems to not be able to get the lock for the queue...
Please not that link_target is ignored in the POC (due to the run-context-based iter_gitworkdir implementation)

Edit: I will take a look at that. Not sure what triggers it, but it might be a thread that crashed while holding a lock.

I think the problem might have been a bug in the batchcommand context-handler. If you actually got an queue.Empty-exception, the "stdin closing" would not be executed. I updated the code, it executes on my example datasets.

You should nevertheless have gotten a timeout after ca. 4 seconds!? Weird! Which dataset do you use?

EDIT: maybe it just takes long? I took 3:10 minutes on my machine to iterate over ca. 33.000 files

@christian-monch
Copy link
Contributor Author

christian-monch commented Oct 27, 2023

I used the iterator on a dataset with 33.000 entries and it took about 3:10 minutes on my development machine.

If you interrupt the execution with you are very likely to end up in the queue-lock acquire code, so that has no specific significance.

@mih
Copy link
Member

mih commented Oct 27, 2023

I used the iterator on a dataset with 33.000 entries and it took about 3:10 minutes on my development machine.

If you interrupt the execution with you are very likely to end up in the queue-lock acquire code, so that has no specific significance.

I did not wait for 3min, indeed. As you can see in dafafb3 I would expect a runtime of substantially below 10s.

@mih mih mentioned this pull request Oct 27, 2023
@christian-monch
Copy link
Contributor Author

christian-monch commented Oct 27, 2023

I used the iterator on a dataset with 33.000 entries and it took about 3:10 minutes on my development machine.
If you interrupt the execution with you are very likely to end up in the queue-lock acquire code, so that has no specific significance.

I did not wait for 3min, indeed. As you can see in dafafb3 I would expect a runtime of substantially below 10s.

It takes between 2 and 3 seconds to do the git-ls-files iteration only (comment out batch processing).

I did the following test:

> git ls-files >/tmp/glf-out
> time git annex lookupkey --batch < /tmp/glf-out >/tmp/alk-out

real    2m45,258s
user    1m16,367s
sys     1m22,510s

And this one:

> time git annex examinekey --batch --json < /tmp/alk-out >/tmp/aek-out

real    0m2,382s
user    0m2,111s
sys     0m0,261s

So, looking up a large number of keys is too slow. I will create a version that uses a run-context for git ls-files, a run-context for git annex find, and a batchcommand-context for git annex examinekey.

@christian-monch
Copy link
Contributor Author

christian-monch commented Oct 27, 2023

I pushed 2913989. It uses two run-contexts to read from git ls-files and git annex find at the "same" time. It only uses a batchcommand-context for git annex examinekey. git annex lookupkey is gone.

This version tries to keep the maximum number of stored results low while having a single pass over both run-context iterators (the actual number is dependent on the ordering of names that are returned from the two run-contexts).

The overall time is around 14 seconds. It is more than 10 seconds, but a lot less than 3:15 minutes.

The code is not as simple anymore though :-/

It should be possible to do the same thing with iter_gitworktree instead of using a run-context. This commit focused on the two run-contexts and their "concurrent" use.

@christian-monch christian-monch force-pushed the iter-annexworktree-run-and-batch branch 2 times, most recently from 2719e2e to 0062a57 Compare October 28, 2023 19:16
@christian-monch christian-monch force-pushed the minimal-run-batch-context branch 2 times, most recently from f0f5f50 to 0c7da95 Compare November 6, 2023 20:02
Co-authored-by: Michael Hanke <[email protected]>

This commit adds a `batchcommand`-context-manager that
keeps track of a subprocess and supports batch-style
communication. That means, it allows to send a request
to the subprocess and will return the response from the
subprocess. How the output of the subprocess has to be
interpreted to yield a response is defined by the
protocol that is provided to the
batchcommand-context-manager. The
batchcommand-context-manager uses the run-context-manager
to ensure that the exit-handler waits for process
termination.

The commit also contains three convenience functions
that provide context-manager for common interaction
patterns (common in `datalad` and `datalad_next` anyways):

1. `stdout_batchcommand`: this context-manager writes to
   `stdin` of the subprocess and returns the next, yet
   unread, chunk of bytes that was written to `stdout`
   of the subprocess.

2. `stdoutline_batchcommand`: this context-manager writes
   to `stdin` of the subprocess and returns the next,
   yet unread, line that was written to `stdout` of the
   subprocess. Line-separators and whether they are
   included in the output can be chosen when creating
   the context-manager.

3. `annexjson_batchcommand`: this context-mananger writes
   to `stdin` of the subprocess and returns a JSON-object,
   that is created from the next, yet unread, line that
   is written to `stdout` of the subprocess. This
   context-mamanger is geared towards git-annex batch
   commands with JSON-result output, i.e. git-annex
   invocations with the parameters `--batch`, `--json`,
   `--json-error-messages`.

The tests provide an example for protocol that handles
output that varies in size, i.e. output of
python-programs. The class that implements the protocol
is called `PythonProtocol`.
@christian-monch christian-monch force-pushed the iter-annexworktree-run-and-batch branch from 72628ee to 569e45a Compare November 6, 2023 20:23
Draft of `iter_annexworktree()`

Performance observations (annexrepo with 36k annexed files):

```
from datalad_next.iter_collections.annexworktree import iter_annexworktree

%timeit res = list(iter_annexworktree('.', untracked='all', link_target=False))
4.22 s ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit res = list(iter_annexworktree('.', untracked=None, link_target=False))
4.33 s ± 195 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit res = list(iter_annexworktree('.', untracked=None, link_target=True))
10.6 s ± 228 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Closes datalad#459
@christian-monch christian-monch force-pushed the iter-annexworktree-run-and-batch branch from 569e45a to 1938fd0 Compare November 6, 2023 20:27
This commit add tests that document the
difference in runtime between processing
n lines intertwined, i.e. write on line
and read one line, and processing n lines
in bulk, i.e. write n lines and read
n lines.
@christian-monch christian-monch force-pushed the iter-annexworktree-run-and-batch branch from 8da1f08 to 909a430 Compare November 13, 2023 17:43
Copy link

codecov bot commented Nov 13, 2023

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (0c7da95) 19.73% compared to head (909a430) 92.41%.

Additional details and impacted files
@@                      Coverage Diff                       @@
##           minimal-run-batch-context     #520       +/-   ##
==============================================================
+ Coverage                      19.73%   92.41%   +72.67%     
==============================================================
  Files                            129      134        +5     
  Lines                           9175     9764      +589     
  Branches                        1054     1084       +30     
==============================================================
+ Hits                            1811     9023     +7212     
+ Misses                          7326      722     -6604     
+ Partials                          38       19       -19     
Files Coverage Δ
...t/iter_collections/tests/test_iterannexworktree.py 100.00% <100.00%> (ø)
datalad_next/runners/__init__.py 100.00% <ø> (ø)
datalad_next/runners/batch.py 100.00% <100.00%> (ø)
datalad_next/runners/tests/test_batch.py 100.00% <100.00%> (ø)
datalad_next/iter_collections/annexworktree.py 98.03% <98.03%> (ø)
datalad_next/runners/protocols.py 94.28% <89.47%> (+44.28%) ⬆️
datalad_next/commands/ls_file_collection.py 93.04% <33.33%> (+53.62%) ⬆️

... and 112 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mih
Copy link
Member

mih commented Nov 23, 2023

With #538 merged, we are betting on a new horse for subprocess execution. I believe this PR can be closed and/or reposted in rejuvenated form.

@mih mih closed this Nov 23, 2023
@christian-monch christian-monch deleted the iter-annexworktree-run-and-batch branch July 16, 2024 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants