-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use iter_subproc
in datalad_next.url_operations.ssh
instead of datalad-core runner code
#546
Use iter_subproc
in datalad_next.url_operations.ssh
instead of datalad-core runner code
#546
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #546 +/- ##
==========================================
- Coverage 92.87% 90.97% -1.91%
==========================================
Files 143 144 +1
Lines 10365 10075 -290
Branches 1141 1144 +3
==========================================
- Hits 9627 9166 -461
- Misses 714 726 +12
- Partials 24 183 +159 ☔ View full report in Codecov by Sentry. |
3536260
to
8687f42
Compare
2c74d00
to
4b4f8af
Compare
I created a rebased version of this PR that sits on the recent state of #539 I would push that here, once that PR is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks nice! It is great to see that the new paradigm is smoothly applicable to this use case too!
I made a few comments. Some low hanging fruit, only one actual bug (progress end reporting).
This commit adds the `align_pattern`-iterator including tests. The `align_pattern`-iterator is used in the `iter_subproc`-based implementation of `SshUrlOperations`.
This commit replaces all uses of `datalad.runner.nonasyncrunner` in `datalad_next.url_operations.ssh` with `datalad_next.runners.iter_subproc` The commit also adapts the tests.
Make key exception type available from the same location.
Unclear why useful
Rather than feeding everything through the pattern aligner. There seems to be no gain, and it can be easily done, it appears.
4b4f8af
to
9f4d49d
Compare
The codecov patch report seems bogus. Here is what I get locally:
Not stellar, but not bad either. |
Benchmark: I downloaded a 230M file over an SSH connection. I see
I checked that the md5sums for all downloads are the same, because it could not easily believe it. Given only N=4, I would not call it "faster", but certainly also not slower. Well done! |
It is unclear to me why it was removed.
The code that closed `dst_fp` was removed by mistake. This commit brings it back. The commit addresses reviewer comment: datalad#546 (comment)
This commit moves the calls to `SshUrlOperations._progress_report_stop` into the `after`-function of `side_effect`. This moves the progress logic into a "single" place, which makes it easier to factor a progress reporting iterator out in the next commit.
This commit factors out the use of `side_effect` in `SshUrlOperation` to implement progress reporting. It creates the helper-function `UrlOperations._reporting` the takes an `Iterable` and yields the items in the iterable while calling progress report methods for each item. It also starts and ends the progress reporting.
This commit updates the docstring of `align_pattern` to document that `align_pattern` can be "removed" from a stream after it yields an item. This is used in `SshUrlOperations._get_props` to return a "clean" stream-iterator, i.e. a stream iterator that does not still execute the `align_pattern`-code. (The code was actually already doing the right thing, but comments were still suggesting, that `align_pattern` stays in the loop.)
94860b2
to
ce8ebc0
Compare
ce8ebc0
to
febe227
Compare
This commit adds a simple fix for the test `test_exception_from_return_code` to allow it to pass if the command output is localized for `de_DE`.
This commit catches `StopIteration`-excptions that are raised in an `iter_subproc`-context to ensure that `IterableSubprocessError` generation is performed (if the subprocess returned non-zero). Without catching `StopIteration`-exceptions within the subproc-context, a `next()` on an exhausted iterator would trigger a code-path that does not check the return code of the subprocess. This path is actually triggered by almost all exceptions, but we generate `StopIteration` as part of our error-free program flow, so we have to deal with it at some point. There are a few possible solutions here: 1. Do what this commit does. Don't let expected exceptions leave the subproc-context. 2. Modify the code in such a way, that it does not use `next` on iterators. 3. Ensure that `iterable_subprocess` executes the check for `IterableSubprocessError` in the presence of an exception from the context. This could easily be done by putting the check into a `finally`-branch. But that would "hide" a code exception, if the subprocess does exit with non-zero. (And would not pass a lot of tests)
This reverts commit b96efdc.
# any exception that is raised in this context and not caught | ||
# will prevent the creation of `IterableSubprocessError`. But | ||
# we rely on the return code of the ssh-process to signal | ||
# specific errors. Therefore, we catch the expected | ||
# `StopIteration` here. | ||
try: | ||
props = self._get_props(url, stream) | ||
except StopIteration: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont understand why this is necessary and also not when taken the comment into account. The for
loop should catch StopIteration
In [1]: try:
...: for i in [1,2,3]:
...: print(i)
...: except Exception:
...: print('bummer')
...:
1
2
3
And no other exception should be shadowed by StopIteration
In [5]: try:
...: for i in [1,2,3]:
...: print(i)
...: raise RuntimeError
...: except StopIteration:
...: print('bummer')
...:
1
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[5], line 4
2 for i in [1,2,3]:
3 print(i)
----> 4 raise RuntimeError
5 except StopIteration:
6 print('bummer')
RuntimeError:
Please help!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example without an exception in the iter_subproc
-context:
>>> from datalad_next.runners.iter_subproc import iter_subproc
>>> with iter_subproc(['ls', '/THIS_DOES_NOT_EXIST']) as sub:
... tuple(sub)
...
()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.10/contextlib.py", line 142, in __exit__
next(self.gen)
File "/home/cristian/Develop/datalad-next/datalad_next/iterable_subprocess/iterable_subprocess.py", line 153, in iterable_subprocess
raise IterableSubprocessError(proc.returncode, b''.join(stderr_deque)[-chunk_size:])
datalad_next.iterable_subprocess.iterable_subprocess.IterableSubprocessError: (2, b"ls: Zugriff auf '/THIS_DOES_NOT_EXIST' nicht m\xc3\xb6glich: Datei oder Verzeichnis nicht gefunden\n")
This raises IterableSubprocessError
.
The following example raises StopIteration
:
>>> from datalad_next.runners.iter_subproc import iter_subproc
>>> with iter_subproc(['ls', '/THIS_DOES_NOT_EXIST']) as sub:
... next(sub)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
StopIteration
Because we want to get the return code of our subprocesses, we need to catch StopIteration
(in this case it is StopIteration
, but generally all exceptions that are raised in the iter_subproc
-context need to be caught to get to an IterableSubprocessError
).
We could of course write code that does not raise StopIteration
, but that requires to pass error values around, which is not too nice either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how do the examples relate to the usage pattern in this PR? The iteration is done within the for-loop. How would StopIteration
get out? My last example raises inside the for-loop, and we see RuntimeError
, the actual exception.
If I understand you correctly, IterableSubprocessError
would never come out, unless all other exceptions are caught. But I do not see that (and it would be a bit weird):
In [1]: from datalad_next.runners import iter_subproc
In [2]: with iter_subproc(['ls', '/THIS_DOES_NOT_EXIST']) as sub:
...: for i in sub:
...: print('some')
...:
---------------------------------------------------------------------------
IterableSubprocessError Traceback (most recent call last)
Cell In[2], line 1
----> 1 with iter_subproc(['ls', '/THIS_DOES_NOT_EXIST']) as sub:
2 for i in sub:
3 print('some')
File /usr/lib/python3.11/contextlib.py:144, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
142 if typ is None:
143 try:
--> 144 next(self.gen)
145 except StopIteration:
146 return False
File ~/hacking/datalad/next/datalad_next/iterable_subprocess/iterable_subprocess.py:153, in iterable_subprocess(program, input_chunks, chunk_size)
150 raise e.__context__ from None
152 if proc.returncode:
--> 153 raise IterableSubprocessError(proc.returncode, b''.join(stderr_deque)[-chunk_size:])
IterableSubprocessError: (2, b"ls: cannot access '/THIS_DOES_NOT_EXIST': No such file or directory\n")
Take a look at the traceback ^^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how do the examples relate to the usage pattern in this PR? The iteration is done within the for-loop. How would
StopIteration
get out? My last example raises inside the for-loop, and we seeRuntimeError
, the actual exception.
[...]
We also use next()
in self._get_props()
. We can use a for-loop for those as well, but then we have to pass an error-result, e.g. None
to the caller and handle that in the the caller.
If I understand you correctly,
IterableSubprocessError
would never come out, unless all other exceptions are caught. [,,,]
Yes! IterableSubprocessError
will never come out, if any exception is raised in the iter_subproc
-context (there is one exception from that rule, the internal _BrokenPipeError
). E.g:
>>> from datalad_next.runners.iter_subproc import iter_subproc
>>> with iter_subproc(['ls', '/THIS_DOES_NOT_EXIST']) as sub:
... raise(ValueError)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
ValueError
So we have to prevent exception-raising, if we want to get access to the return-code of the subprocess.
[...] But I do not see that (and it would be a bit weird):
In [1]: from datalad_next.runners import iter_subproc In [2]: with iter_subproc(['ls', '/THIS_DOES_NOT_EXIST']) as sub: ...: for i in sub: ...: print('some') ...: --------------------------------------------------------------------------- IterableSubprocessError Traceback (most recent call last) Cell In[2], line 1 ----> 1 with iter_subproc(['ls', '/THIS_DOES_NOT_EXIST']) as sub: 2 for i in sub: 3 print('some') File /usr/lib/python3.11/contextlib.py:144, in _GeneratorContextManager.__exit__(self, typ, value, traceback) 142 if typ is None: 143 try: --> 144 next(self.gen) 145 except StopIteration: 146 return False File ~/hacking/datalad/next/datalad_next/iterable_subprocess/iterable_subprocess.py:153, in iterable_subprocess(program, input_chunks, chunk_size) 150 raise e.__context__ from None 152 if proc.returncode: --> 153 raise IterableSubprocessError(proc.returncode, b''.join(stderr_deque)[-chunk_size:]) IterableSubprocessError: (2, b"ls: cannot access '/THIS_DOES_NOT_EXIST': No such file or directory\n")Take a look at the traceback ^^^
IIANM, the traceback does not involve an exception raised in the context (the value of typ
in _GeneratorContextManager.__exit__
is None
), therefore lines 152 - 153 of iterable_subprocess.py
are executed.
But this is not the scenario that a try: ... except StopIteration:
is protecting against. The try: ... except StopIteration:
prevents that an exception that is raised in the context, i.e. in the body of with iter_subproc() as x:
, does not leave the context.
Raising any exception in the body of with iter_subproc() as x:
would trigger execution of lines 138 - 139 of iterable_subprocess.py
:
137 except BaseException:
138 proc.terminate()
139 raise
And lines 152 - 153 of iterable_subprocess.py
are never executed, hence IterableSubprocessError
is never raised.
Edit: as a note: I think the behavior of iterable_subprocess
is sensible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think I am slowly getting what you are saying. I believe the key is:
We also use next() in self._get_props().
But to me all the wrapping in try/except seems to be solely, because that is the case, and not "also".
If that is true, it would seem to be more sensible to prevent _get_props()
from ever raising StopIteration
. It is an internal function that always needs this protection, and that need is neither obvious, nor documented.
But that seems to require a rather complex logic, because, again if I understand this right, we would want to leave the iter_subproc
context immediately, and have an outer shell catch the then generated IterableSubprocessError
-- which is only generated, if there is no exception already.
If that is all correct, we need to discuss two things:
- this should be documented in
iter_subproc()
-- it is all but obvious - can this be simplified (avoid having to wrap any and all context body like this) by also handling
StopIteration
where_BrokenPipeError
is handled initerable_subprocess()
? Basically:
except _BrokenPipeError as e:
if proc.returncode == 0:
raise e.__context__ from None
# add this ----------------
except StopIteration:
# something try to read output, but there is none
# if that is a problem the next check will tell
# until here ---------------
if proc.returncode:
raise IterableSubprocessError(proc.returncode, b''.join(stderr_deque)[-chunk_size:])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think I am slowly getting what you are saying. I believe the key is:
[...]We also use next() in self._get_props().
But to me all the wrapping in try/except seems to be solely, because that is the case, and not "also".
That is exactly right, you said it much better than me (any missunderstandings here are a result of my English communication skills: the "also" referred to fact that we use next()
in addition to for
-loops).
If that is true, it would seem to be more sensible to prevent
_get_props()
from ever raisingStopIteration
. It is an internal function that always needs this protection, and that need is neither obvious, nor documented.
Documentation is a good idea. It should probably be done somewhere in iter_subproc
because any exception that is raised within the body of an iter_subproc
-context would prevent the execution of the code-path that creates IterableSubprocessError
. And we might need the return code in many places.
[ it would seem to be more sensible to prevent
_get_props()
from ever raisingStopIteration
] But that seems to require a rather complex logic, because, again if I understand this right, we would want to leave theiter_subproc
context immediately, and have an outer shell catch the then generatedIterableSubprocessError
-- which is only generated, if there is no exception already.
That is correct. I also think that catching StopIteration
in _get_props()
and signaling an error in our code, e.g. by returning None
would complicate the logic of the callers of _get_props()
, it would make the code harder to read, and it does not seem very "pythonic".
If that is all correct, we need to discuss two things:
- this should be documented in
iter_subproc()
-- it is all but obvious
Very good idea (I like the location ;-) had a similar thought above)
- can this be simplified (avoid having to wrap any and all context body like this) by also handling
StopIteration
where_BrokenPipeError
is handled initerable_subprocess()
? Basically:except _BrokenPipeError as e: if proc.returncode == 0: raise e.__context__ from None # add this ---------------- except StopIteration: # something try to read output, but there is none # if that is a problem the next check will tell # until here --------------- if proc.returncode: raise IterableSubprocessError(proc.returncode, b''.join(stderr_deque)[-chunk_size:])
That is a very good idea. I was not sure, if we want to change iterable_subprocess
so drastically, and how this would affect its correctness. But I think it is probably the simplest solution. (I did some experiments on that earlier because this is one of the options mentioned in the commit message of 210d947. That was also the motivation for looking into the tests).
I think something like this would work:
except (_BrokenPipeError, StopIterations) as e:
# This except-clause prioritizes `IterableSubprocessError` over
# `StopIteration` and `_BrokenPipeError` to avoid swallowing the former.
# This assumes that any `StopIteration` that is
# caught here, and that coincides with a non-zero return code,
# is raised due to an unexpected short read from `stdout` of
# the subprocess and directly connected to the non-zero
# return code and can therefore be ignored, because the
# return code would transport all necessary information.
if proc.returncode == 0:
raise e.__context__ from None
If we want to do this, I would suggest making the change to iterable_subprocesses
in an extra PR, which includes docs and tests. Once this is merged, we simplify UrlSshOperations
by removing the unnecessary try
-except StopIteration
-statements.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is a good approach (dedicated PR first, and this one waiting).
I will keep trying to think of a reason why we would not want to change iterable_subprocesses
. Obviously, fewer changes are better, but this issues seems to be impossible to fix from the outside.
Conceptually, _BrokenPipeError
and StopIteration
appear to be to variants of the same problem: premature death. It continues to appear sensible to me to also deal with them in the same way.
You patch is a minimal as it can get. What I cannot anticipate are sideeffect of raising the StopIteration.__context__
. Never seen that being done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[...] You[r] patch is a minimal as it can get. What I cannot anticipate are sideeffect of raising the StopIteration.context. Never seen that being done.
That is a good point because we should not do that. So the patch will have to be a little more intrusive, something like:
except _BrokenPipeError as e:
if proc.returncode == 0:
raise e.__context__ from None
except StopIteration:
if proc.returncode == 0:
raise
[...]
Conceptually,_BrokenPipeError
andStopIteration
appear to be to variants of the same problem: premature death. It continues to appear sensible to me to also deal with them in the same way.
[...]
That is a nice thought. I was struggeling to find sound reasoning for the patch.
Preparing the PR now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Michael Hanke <[email protected]>
This commit finalizes the changes in the previous commit, which renamed the usage of `UrlOperations._reporting` to `UrlOperations._with_progress`. This commit performs the missing implementation renaming.
c68b98a
to
895a2b5
Compare
This commit modifies `iterable_subprocess` so that it prioritizes `IterableSubprocessError` exceptions over `StopIteration` exceptions, if a `StopIteration` exception is raised in the body of an `iterable_subprocess`-context and if the sub process exits with a non-zero return code. `StopIteration` in conjunction with a non-zero return code indicates an unexpected short read due to an early "error"-exit of the sub process. This commit is a result of this discussion: datalad#546 (comment)
Yhus PR is replaced by #566 |
This PR uses
iter_subproc
instead of the datalad-core basedrun
context-manager to implementSshUrlOperations
.The code changes are based on
iter-annexworktree-subproc
-branch (PR #539) and contain only one commit beyond the branchiter-annexworktree-subproc
: 64e82d7. This branch should be rebased onmain
once PR #539 is merged.TODO: