Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use iter_subproc in datalad_next.url_operations.ssh instead of datalad-core runner code #546

Closed

Conversation

christian-monch
Copy link
Contributor

@christian-monch christian-monch commented Nov 24, 2023

This PR uses iter_subproc instead of the datalad-core based run context-manager to implement SshUrlOperations.

The code changes are based on iter-annexworktree-subproc-branch (PR #539) and contain only one commit beyond the branch iter-annexworktree-subproc: 64e82d7. This branch should be rebased on main once PR #539 is merged.

TODO:

  • Run a benchmark

@christian-monch christian-monch requested a review from mih as a code owner November 24, 2023 13:28
Copy link

codecov bot commented Nov 24, 2023

Codecov Report

Attention: 53 lines in your changes are missing coverage. Please review.

Comparison is base (26e6cb4) 92.87% compared to head (895a2b5) 90.97%.
Report is 10 commits behind head on main.

Files Patch % Lines
datalad_next/url_operations/ssh.py 25.37% 47 Missing and 3 partials ⚠️
datalad_next/itertools/align_pattern.py 88.88% 1 Missing and 1 partial ⚠️
datalad_next/url_operations/__init__.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #546      +/-   ##
==========================================
- Coverage   92.87%   90.97%   -1.91%     
==========================================
  Files         143      144       +1     
  Lines       10365    10075     -290     
  Branches     1141     1144       +3     
==========================================
- Hits         9627     9166     -461     
- Misses        714      726      +12     
- Partials       24      183     +159     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@christian-monch christian-monch force-pushed the iterable-subprocess-sshurl branch 2 times, most recently from 3536260 to 8687f42 Compare November 27, 2023 04:09
@christian-monch christian-monch force-pushed the iterable-subprocess-sshurl branch 5 times, most recently from 2c74d00 to 4b4f8af Compare November 29, 2023 06:17
@mih
Copy link
Member

mih commented Dec 6, 2023

I created a rebased version of this PR that sits on the recent state of #539

I would push that here, once that PR is merged.

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks nice! It is great to see that the new paradigm is smoothly applicable to this use case too!

I made a few comments. Some low hanging fruit, only one actual bug (progress end reporting).

christian-monch and others added 5 commits December 6, 2023 14:21
This commit adds the `align_pattern`-iterator
including tests.

The `align_pattern`-iterator is used in
the `iter_subproc`-based implementation
of `SshUrlOperations`.
This commit replaces all uses of
`datalad.runner.nonasyncrunner` in
`datalad_next.url_operations.ssh`
with `datalad_next.runners.iter_subproc`

The commit also adapts the tests.
Make key exception type available from the same location.
Unclear why useful
Rather than feeding everything through the pattern aligner.
There seems to be no gain, and it can be easily done, it appears.
@mih mih force-pushed the iterable-subprocess-sshurl branch from 4b4f8af to 9f4d49d Compare December 6, 2023 13:24
@mih
Copy link
Member

mih commented Dec 6, 2023

The codecov patch report seems bogus. Here is what I get locally:

DATALAD_TESTS_SSH=1 python -m pytest --cov datalad_next.url_operations.ssh -s -v datalad_next/url_operations/tests/test_ssh.py
...
---------- coverage: platform linux, python 3.11.2-final-0 -----------
Name                                 Stmts   Miss Branch BrPart  Cover   Missing
--------------------------------------------------------------------------------
datalad_next/url_operations/ssh.py     106      7     34      4    89%   75-79, 100, 207, 315, 317->321
--------------------------------------------------------------------------------
TOTAL                                  106      7     34      4    89%

Not stellar, but not bad either.

@mih
Copy link
Member

mih commented Dec 6, 2023

Benchmark:

I downloaded a 230M file over an SSH connection. I see

  • ~50+ seconds (4.5MB/s) for scp
  • ~45 seconds (5.5MB/s) for datalad download

I checked that the md5sums for all downloads are the same, because it could not easily believe it.

Given only N=4, I would not call it "faster", but certainly also not slower. Well done!

It is unclear to me why it was removed.
The code that closed `dst_fp` was removed by
mistake. This commit brings it back.

The commit addresses reviewer comment:
datalad#546 (comment)
This commit moves the calls to
`SshUrlOperations._progress_report_stop` into
the `after`-function of `side_effect`.

This moves the progress logic into a "single"
place, which makes it easier to factor a
progress reporting iterator out in the next
commit.
This commit factors out the use of `side_effect`
in `SshUrlOperation` to implement progress
reporting.

It creates the helper-function
`UrlOperations._reporting` the takes an
`Iterable` and yields the items in the
iterable while calling progress report
methods for each item. It also starts
and ends the progress reporting.
This commit updates the docstring of `align_pattern`
to document that `align_pattern` can be "removed"
from a stream after it yields an item.

This is used in `SshUrlOperations._get_props`
to return a "clean" stream-iterator, i.e. a stream
iterator that does not still execute the
`align_pattern`-code.

(The code was actually already doing the right
thing, but comments were still suggesting, that
`align_pattern` stays in the loop.)
@christian-monch christian-monch force-pushed the iterable-subprocess-sshurl branch from 94860b2 to ce8ebc0 Compare December 6, 2023 17:39
@christian-monch christian-monch force-pushed the iterable-subprocess-sshurl branch from ce8ebc0 to febe227 Compare December 6, 2023 17:42
This commit adds a simple fix for the test
`test_exception_from_return_code` to allow
it to pass if the command output is localized
for `de_DE`.
This commit catches `StopIteration`-excptions
that are raised in an `iter_subproc`-context
to ensure that `IterableSubprocessError`
generation is performed (if the subprocess
returned non-zero).

Without catching `StopIteration`-exceptions
within the subproc-context, a `next()` on an
exhausted iterator would trigger a code-path
that does not check the
return code of the subprocess. This path is
actually triggered by almost all exceptions,
but we generate `StopIteration` as part of
our error-free program flow, so we have to
deal with it at some point.

There are a few possible solutions here:

1. Do what this commit does. Don't let
   expected exceptions leave the
   subproc-context.

2. Modify the code in such a way, that
   it does not use `next` on iterators.

3. Ensure that `iterable_subprocess`
   executes the check for
   `IterableSubprocessError` in the
   presence of an exception from the
   context. This could easily be done
   by putting the check into a
   `finally`-branch. But that would
   "hide" a code exception, if the
   subprocess does exit with non-zero.
   (And would not pass a lot of tests)
Comment on lines +97 to +105
# any exception that is raised in this context and not caught
# will prevent the creation of `IterableSubprocessError`. But
# we rely on the return code of the ssh-process to signal
# specific errors. Therefore, we catch the expected
# `StopIteration` here.
try:
props = self._get_props(url, stream)
except StopIteration:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont understand why this is necessary and also not when taken the comment into account. The for loop should catch StopIteration

In [1]: try:
   ...:     for i in [1,2,3]:
   ...:         print(i)
   ...: except Exception:
   ...:     print('bummer')
   ...: 
1
2
3

And no other exception should be shadowed by StopIteration

In [5]: try:
   ...:     for i in [1,2,3]:
   ...:         print(i)
   ...:         raise RuntimeError
   ...: except StopIteration:
   ...:     print('bummer')
   ...: 
1
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[5], line 4
      2     for i in [1,2,3]:
      3         print(i)
----> 4         raise RuntimeError
      5 except StopIteration:
      6     print('bummer')

RuntimeError: 

Please help!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example without an exception in the iter_subproc-context:

>>> from datalad_next.runners.iter_subproc import iter_subproc
>>> with iter_subproc(['ls', '/THIS_DOES_NOT_EXIST']) as sub:
...     tuple(sub)
... 
()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/home/cristian/Develop/datalad-next/datalad_next/iterable_subprocess/iterable_subprocess.py", line 153, in iterable_subprocess
    raise IterableSubprocessError(proc.returncode, b''.join(stderr_deque)[-chunk_size:])
datalad_next.iterable_subprocess.iterable_subprocess.IterableSubprocessError: (2, b"ls: Zugriff auf '/THIS_DOES_NOT_EXIST' nicht m\xc3\xb6glich: Datei oder Verzeichnis nicht gefunden\n")

This raises IterableSubprocessError.

The following example raises StopIteration:

>>> from datalad_next.runners.iter_subproc import iter_subproc
>>> with iter_subproc(['ls', '/THIS_DOES_NOT_EXIST']) as sub:
...     next(sub)
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
StopIteration

Because we want to get the return code of our subprocesses, we need to catch StopIteration (in this case it is StopIteration, but generally all exceptions that are raised in the iter_subproc-context need to be caught to get to an IterableSubprocessError).

We could of course write code that does not raise StopIteration, but that requires to pass error values around, which is not too nice either.

Copy link
Member

@mih mih Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how do the examples relate to the usage pattern in this PR? The iteration is done within the for-loop. How would StopIteration get out? My last example raises inside the for-loop, and we see RuntimeError, the actual exception.

If I understand you correctly, IterableSubprocessError would never come out, unless all other exceptions are caught. But I do not see that (and it would be a bit weird):

In [1]: from datalad_next.runners import iter_subproc
In [2]: with iter_subproc(['ls', '/THIS_DOES_NOT_EXIST']) as sub:
   ...:     for i in sub:
   ...:         print('some')
   ...: 
---------------------------------------------------------------------------
IterableSubprocessError                   Traceback (most recent call last)
Cell In[2], line 1
----> 1 with iter_subproc(['ls', '/THIS_DOES_NOT_EXIST']) as sub:
      2     for i in sub:
      3         print('some')

File /usr/lib/python3.11/contextlib.py:144, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
    142 if typ is None:
    143     try:
--> 144         next(self.gen)
    145     except StopIteration:
    146         return False

File ~/hacking/datalad/next/datalad_next/iterable_subprocess/iterable_subprocess.py:153, in iterable_subprocess(program, input_chunks, chunk_size)
    150         raise e.__context__ from None
    152 if proc.returncode:
--> 153     raise IterableSubprocessError(proc.returncode, b''.join(stderr_deque)[-chunk_size:])

IterableSubprocessError: (2, b"ls: cannot access '/THIS_DOES_NOT_EXIST': No such file or directory\n")

Take a look at the traceback ^^^

Copy link
Contributor Author

@christian-monch christian-monch Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how do the examples relate to the usage pattern in this PR? The iteration is done within the for-loop. How would StopIteration get out? My last example raises inside the for-loop, and we see RuntimeError, the actual exception.
[...]

We also use next() in self._get_props(). We can use a for-loop for those as well, but then we have to pass an error-result, e.g. None to the caller and handle that in the the caller.

If I understand you correctly, IterableSubprocessError would never come out, unless all other exceptions are caught. [,,,]

Yes! IterableSubprocessError will never come out, if any exception is raised in the iter_subproc-context (there is one exception from that rule, the internal _BrokenPipeError). E.g:

>>> from datalad_next.runners.iter_subproc import iter_subproc
>>> with iter_subproc(['ls', '/THIS_DOES_NOT_EXIST']) as sub:
...     raise(ValueError)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
ValueError

So we have to prevent exception-raising, if we want to get access to the return-code of the subprocess.

[...] But I do not see that (and it would be a bit weird):

In [1]: from datalad_next.runners import iter_subproc
In [2]: with iter_subproc(['ls', '/THIS_DOES_NOT_EXIST']) as sub:
   ...:     for i in sub:
   ...:         print('some')
   ...: 
---------------------------------------------------------------------------
IterableSubprocessError                   Traceback (most recent call last)
Cell In[2], line 1
----> 1 with iter_subproc(['ls', '/THIS_DOES_NOT_EXIST']) as sub:
      2     for i in sub:
      3         print('some')

File /usr/lib/python3.11/contextlib.py:144, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
    142 if typ is None:
    143     try:
--> 144         next(self.gen)
    145     except StopIteration:
    146         return False

File ~/hacking/datalad/next/datalad_next/iterable_subprocess/iterable_subprocess.py:153, in iterable_subprocess(program, input_chunks, chunk_size)
    150         raise e.__context__ from None
    152 if proc.returncode:
--> 153     raise IterableSubprocessError(proc.returncode, b''.join(stderr_deque)[-chunk_size:])

IterableSubprocessError: (2, b"ls: cannot access '/THIS_DOES_NOT_EXIST': No such file or directory\n")

Take a look at the traceback ^^^

IIANM, the traceback does not involve an exception raised in the context (the value of typ in _GeneratorContextManager.__exit__ is None), therefore lines 152 - 153 of iterable_subprocess.py are executed.

But this is not the scenario that a try: ... except StopIteration: is protecting against. The try: ... except StopIteration: prevents that an exception that is raised in the context, i.e. in the body of with iter_subproc() as x:, does not leave the context.

Raising any exception in the body of with iter_subproc() as x: would trigger execution of lines 138 - 139 of iterable_subprocess.py:

   137              except BaseException:
   138                  proc.terminate()
   139                  raise

And lines 152 - 153 of iterable_subprocess.py are never executed, hence IterableSubprocessError is never raised.

Edit: as a note: I think the behavior of iterable_subprocess is sensible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I am slowly getting what you are saying. I believe the key is:

We also use next() in self._get_props().

But to me all the wrapping in try/except seems to be solely, because that is the case, and not "also".

If that is true, it would seem to be more sensible to prevent _get_props() from ever raising StopIteration. It is an internal function that always needs this protection, and that need is neither obvious, nor documented.

But that seems to require a rather complex logic, because, again if I understand this right, we would want to leave the iter_subproc context immediately, and have an outer shell catch the then generated IterableSubprocessError -- which is only generated, if there is no exception already.

If that is all correct, we need to discuss two things:

  • this should be documented in iter_subproc() -- it is all but obvious
  • can this be simplified (avoid having to wrap any and all context body like this) by also handling StopIteration where _BrokenPipeError is handled in iterable_subprocess()? Basically:
    except _BrokenPipeError as e:
        if proc.returncode == 0:
            raise e.__context__ from None
    # add this ----------------
    except StopIteration:
       # something try to read output, but there is none
       # if that is a problem the next check will tell
    # until here ---------------

    if proc.returncode:
        raise IterableSubprocessError(proc.returncode, b''.join(stderr_deque)[-chunk_size:])

Copy link
Contributor Author

@christian-monch christian-monch Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I am slowly getting what you are saying. I believe the key is:
[...]

We also use next() in self._get_props().

But to me all the wrapping in try/except seems to be solely, because that is the case, and not "also".

That is exactly right, you said it much better than me (any missunderstandings here are a result of my English communication skills: the "also" referred to fact that we use next() in addition to for-loops).

If that is true, it would seem to be more sensible to prevent _get_props() from ever raising StopIteration. It is an internal function that always needs this protection, and that need is neither obvious, nor documented.

Documentation is a good idea. It should probably be done somewhere in iter_subproc because any exception that is raised within the body of an iter_subproc-context would prevent the execution of the code-path that creates IterableSubprocessError. And we might need the return code in many places.

[ it would seem to be more sensible to prevent _get_props() from ever raising StopIteration] But that seems to require a rather complex logic, because, again if I understand this right, we would want to leave the iter_subproc context immediately, and have an outer shell catch the then generated IterableSubprocessError -- which is only generated, if there is no exception already.

That is correct. I also think that catching StopIteration in _get_props() and signaling an error in our code, e.g. by returning None would complicate the logic of the callers of _get_props(), it would make the code harder to read, and it does not seem very "pythonic".

If that is all correct, we need to discuss two things:

  • this should be documented in iter_subproc() -- it is all but obvious

Very good idea (I like the location ;-) had a similar thought above)

  • can this be simplified (avoid having to wrap any and all context body like this) by also handling StopIteration where _BrokenPipeError is handled in iterable_subprocess()? Basically:
    except _BrokenPipeError as e:
        if proc.returncode == 0:
            raise e.__context__ from None
    # add this ----------------
    except StopIteration:
       # something try to read output, but there is none
       # if that is a problem the next check will tell
    # until here ---------------

    if proc.returncode:
        raise IterableSubprocessError(proc.returncode, b''.join(stderr_deque)[-chunk_size:])

That is a very good idea. I was not sure, if we want to change iterable_subprocess so drastically, and how this would affect its correctness. But I think it is probably the simplest solution. (I did some experiments on that earlier because this is one of the options mentioned in the commit message of 210d947. That was also the motivation for looking into the tests).

I think something like this would work:

    except (_BrokenPipeError, StopIterations) as e:
        # This except-clause prioritizes `IterableSubprocessError` over
        # `StopIteration` and `_BrokenPipeError` to avoid swallowing the former.
        # This assumes that any `StopIteration` that is
        # caught here, and that coincides with a non-zero return code,
        # is raised due to an unexpected short read from `stdout` of
        # the subprocess and directly connected to the non-zero
        # return code and can therefore be ignored, because the
        # return code would transport all necessary information.
        if proc.returncode == 0:
            raise e.__context__ from None

If we want to do this, I would suggest making the change to iterable_subprocesses in an extra PR, which includes docs and tests. Once this is merged, we simplify UrlSshOperations by removing the unnecessary try-except StopIteration-statements.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is a good approach (dedicated PR first, and this one waiting).

I will keep trying to think of a reason why we would not want to change iterable_subprocesses. Obviously, fewer changes are better, but this issues seems to be impossible to fix from the outside.

Conceptually, _BrokenPipeError and StopIteration appear to be to variants of the same problem: premature death. It continues to appear sensible to me to also deal with them in the same way.

You patch is a minimal as it can get. What I cannot anticipate are sideeffect of raising the StopIteration.__context__. Never seen that being done.

Copy link
Contributor Author

@christian-monch christian-monch Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] You[r] patch is a minimal as it can get. What I cannot anticipate are sideeffect of raising the StopIteration.context. Never seen that being done.

That is a good point because we should not do that. So the patch will have to be a little more intrusive, something like:

    except _BrokenPipeError as e:
        if proc.returncode == 0:
            raise e.__context__ from None
    except StopIteration:
        if proc.returncode == 0:
            raise

[...]
Conceptually, _BrokenPipeError and StopIteration appear to be to variants of the same problem: premature death. It continues to appear sensible to me to also deal with them in the same way.
[...]

That is a nice thought. I was struggeling to find sound reasoning for the patch.

Preparing the PR now.

Copy link
Contributor Author

@christian-monch christian-monch Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...]
Preparing the PR now.

I had another idea on how to make return codes accessible in the presence of exceptions, which is a lot more pythonic and cleaner FMPOV: PR #565.

(I had this idea after opening PR #562, which implements what is outlined above. So I put PR #562 into draft-mode)

christian-monch and others added 2 commits December 7, 2023 17:41
This commit finalizes the changes in the previous
commit, which renamed the usage of
`UrlOperations._reporting` to `UrlOperations._with_progress`.
This commit performs the missing implementation renaming.
@christian-monch christian-monch force-pushed the iterable-subprocess-sshurl branch from c68b98a to 895a2b5 Compare December 7, 2023 22:42
christian-monch added a commit to christian-monch/datalad-next that referenced this pull request Dec 8, 2023
This commit modifies `iterable_subprocess` so that
it prioritizes `IterableSubprocessError` exceptions
over `StopIteration` exceptions, if a `StopIteration`
exception is raised in the body of an
`iterable_subprocess`-context and if the sub process
exits with a non-zero return code.

`StopIteration` in conjunction with a non-zero
return code indicates an unexpected short read
due to an early "error"-exit of the sub process.

This commit is a result of this discussion:
datalad#546 (comment)
@mih
Copy link
Member

mih commented Dec 13, 2023

Yhus PR is replaced by #566

@mih mih closed this Dec 13, 2023
@christian-monch christian-monch deleted the iterable-subprocess-sshurl branch July 16, 2024 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants