Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSHRemoteIO on windows hangs #68

Closed
mih opened this issue Sep 13, 2023 · 13 comments · Fixed by #69
Closed

SSHRemoteIO on windows hangs #68

mih opened this issue Sep 13, 2023 · 13 comments · Fixed by #69

Comments

@mih
Copy link
Member

mih commented Sep 13, 2023

This is an interim conclusion from #58

A test tries to run init_remote() for an ora remote, not even expecting it to work, but error out due to insufficiently meet preconditions. But it just hangs.

This may be due to the fact that the target connection is via SSH and the target port is ignored by the implementation of SSHRemoteIO from core. but the same code passes on mac and linux.

This could be investigated by deploying a matching ssh config that declares the port.

@mih
Copy link
Member Author

mih commented Sep 13, 2023

#69 narrows this down to SSHRemoteIO

@mih mih changed the title init_remote() on windows hangs SSHRemoteIO on windows hangs Sep 13, 2023
@mih
Copy link
Member Author

mih commented Sep 14, 2023

Debugging log on windows:

Plain ssh works fine

c:\projects\datalad-ria>ssh -i C:\DLTMP\datalad_tests_id_rsa ssh://sshuser@datalad-test-ria:2222
Linux 1ba80b1234c2 4.19.27-linuxkit #1 SMP Sun Mar 10 18:51:44 UTC 2019 x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Thu Sep 14 06:45:43 2023 from 10.0.0.40
sshuser@1ba80b1234c2:~$ exit
logout
Connection to datalad-test-ria closed.

With SSHRemoteIO not

>>> import datalad
>>> datalad.cfg.set('datalad.log.level', 'debug', scope='global')
>>> datalad.cfg.set('datalad.ssh.identityfile', 'C:\DLTMP\datalad_tests_id_rsa', scope='global')
>>> import datalad.distributed.ora_remote as ora
>>> ssh=ora.SSHRemoteIO('ssh://sshuser@datalad-test-ria:2222')

<hangs>

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python39-x64\lib\site-packages\datalad\distributed\ora_remote.py", line 351, in __init__
    line = self.shell.stdout.readline()

With SSHManager

>>> import datalad
>>> datalad.cfg.get('datalad.ssh.multiplex-connections')
>>> from datalad.support.sshconnector import SSHManager as sshman
>>> sshman
<class 'datalad.support.sshconnector.NoMultiplexSSHManager'>
>>> url = 'ssh://sshuser@datalad-test-ria:2222'
>>> sshman._prep_connection_args(url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: _prep_connection_args() missing 1 required positional argument: 'url'
>>> sshman._prep_connection_args(url)sm
KeyboardInterrupt
>>> sm = sshman()
>>> sm._prep_connection_args(url)
(URL(hostname='datalad-test-ria', netloc='sshuser@datalad-test-ria:2222', port=2222, scheme='ssh', username='sshuser'), 'C:\\DLTMP\\datalad_tests_id_rsa')
>>> sm.get_connection(url)
NoMultiplexSSHConnection(sshri=<<SSHRI(hostname++47 chars++er')>>)
>>> con = sm.get_connection(url)
>>> con.ssh_executable
'C:\\Windows\\System32\\OpenSSH\\ssh.exe'
>>> con._ssh_open_args
['-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa']
>>> con._ssh_args
[]
>>> con('uname')
[DEBUG] NoMultiplexSSHConnection(sshri=<<SSHRI(hostname++47 chars++er')>>) is used to run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname']
[DEBUG] Run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] (protocol_class=StdOutErrCapture) (cwd=None)
[DEBUG] Finished ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] with status 0
('Linux\n', '')
>>>
<hangs>

and it can be repeatedly hung:

>>> import datalad
>>> from datalad.support.sshconnector import SSHManager as sshman
>>> sm = sshman()
>>> url = 'ssh://sshuser@datalad-test-ria:2222'
>>> con = sm.get_connection(url)
>>> con('uname')
[DEBUG] NoMultiplexSSHConnection(sshri=<<SSHRI(hostname++47 chars++er')>>) is used to run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname']
[DEBUG] Run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] (protocol_class=StdOutErrCapture) (cwd=None)
[DEBUG] Finished ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] with status 0
('Linux\n', '')
>>>
<hangs>

What seems important is that the function returns (the return value is there), and there is a new prompt, but afterwards it "hangs", as if the interpreter is not accepting new inputs. once I kill the Python process, I can reuse the terminal session with a new Python session without issues, and I can make it hang at that exact same spot.

I can also make it hang, if I run that exact same command directly via the runner:

>>> con.runner.run(['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'])
[DEBUG] Run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] (protocol_class=NoCapture) (cwd=None)
Linux
>>>

<hangs>

and also without any DataLad at all

>>> import subprocess
>>> subprocess.run(['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'])

Linux
CompletedProcess(args=['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', ....
>>>  

<hangs>

But it needs the python session, outside (but in the same terminal session), it runs fine, any number of times:

c:\projects\datalad-ria>C:\Windows\System32\OpenSSH\ssh.exe -p 2222 -i C:\DLTMP\datalad_tests_id_rsa sshuser@datalad-test-ria uname
Linux

c:\projects\datalad-ria>
c:\projects\datalad-ria>C:\Windows\System32\OpenSSH\ssh.exe -p 2222 -i C:\DLTMP\datalad_tests_id_rsa sshuser@datalad-test-ria uname
Linux

c:\projects\datalad-ria>

@mih
Copy link
Member Author

mih commented Sep 14, 2023

subprocess docs contain the following bits that may need to be checked


If close_fds is true, all file descriptors except 0, 1 and 2 will be closed before the child process is executed. Otherwise when close_fds is false, file descriptors obey their inheritable flag as described in Inheritance of File Descriptors.

On Windows, if close_fds is true then no handles will be inherited by the child process unless explicitly passed in the handle_list element of STARTUPINFO.lpAttributeList, or by standard handle redirection.

Changed in version 3.2: The default for close_fds was changed from False to what is described above.

Changed in version 3.7: On Windows the default for close_fds was changed from False to True when redirecting the standard handles. It’s now possible to set close_fds to True when redirecting the standard handles.

@adswa
Copy link
Member

adswa commented Sep 14, 2023

I tried a few more things:

versions

(py39) C:\Users\adina>where ssh
C:\Windows\System32\OpenSSH\ssh.exe

(py39) C:\Users\adina>ssh -V
OpenSSH_for_Windows_8.1p1, LibreSSL 3.0.2
(py39) C:\Users\adina>python
Python 3.9.0 (tags/v3.9.0:9cf6752, Oct  5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import subprocess
>>> process = subprocess.Popen(["ssh", "-i",  r"C:\Users\adina\.ssh\id_ed25519_juseless", "juseless.inm7.de", "uname"], shell=False, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
>>> result = process.stdout.readlines()
[hangs]
>>> subprocess.run(r"ssh -i C:\Users\adina\.ssh\id_ed25519_juseless juseless.inm7.de uname", close_fds=False)
Enter passphrase for key 'C:\Users\adina\.ssh\id_ed25519_juseless':
Linux
CompletedProcess(args='ssh -i C:\\Users\\adina\\.ssh\\id_ed25519_juseless juseless.inm7.de uname', returncode=0)
>>>
[hangs]
subprocess.run(r"ssh -i C:\Users\adina\.ssh\id_ed25519_juseless juseless.inm7.de uname", close_fds=True)
Linux
CompletedProcess(args='ssh -i C:\\Users\\adina\\.ssh\\id_ed25519_juseless juseless.inm7.de uname', returncode=0)
>>>
[hangs]

@adswa
Copy link
Member

adswa commented Sep 14, 2023

Maybe some of these Windows specific subprocess configurations could be relevant: https://docs.python.org/3/library/subprocess.html#windows-popen-helpers

@adswa
Copy link
Member

adswa commented Sep 14, 2023

I found a Gist that made things work: https://gist.github.com/josephcoombe/3a234721fc5a6885ca4f91e3a27860f4.

(it needs an additional encoding="utf-8" in the Popen() call, but otherwise works). I will investigate closer

@adswa
Copy link
Member

adswa commented Sep 14, 2023

I don't understand subprocess well, so I'm just sharing a few observations.
This is how I can get subprocess.run() to run without hanging.

import os, platform, subprocess
USER = "adina"
COMMAND = "uname -a"
HOST = "juseless.inm7.de"
system32 = os.path.join(os.environ['SystemRoot'], 'SysNative' if platform.architecture()[0] == '32bit' else 'System32')
ssh_path = os.path.join(system32, 'OpenSSH/ssh.exe')
PRIVATE_KEY_LOCATION = "C:/Users/adina/.ssh/id_ed25519_juseless"
subprocess.run([ssh_path, '-i', PRIVATE_KEY_LOCATION, "{}@{}".format(USER, HOST), COMMAND], stdin=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
CompletedProcess(args=['C:\\Windows\\System32\\OpenSSH/ssh.exe', '-i', 'C:/Users/adina/.ssh/id_ed25519_juseless', '[email protected]', 'uname -a'], returncode=0, stdout=b'Linux juseless 6.1.0-10-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.37-1 (2023-07-03) x86_64 GNU/Linux\n', stderr=b'')

print("No hanging!")

No hanging!

Edit:
The hanging reappears if I don't set stdout=subprocess.PIPE, and also if I don't set stdin=subprocess.PIPE

Edit:
The complex ssh_path construction also does not seem to be necessary, it works also with a plain "ssh".

Edit:

I have tested that the call

subprocess.run(["ssh", '-i', PRIVATE_KEY_LOCATION, "{}@{}".format(USER, HOST), COMMAND], stdin=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE)

works (in CMD) if:

  • I don't have Git installed
  • I have Git installed and configured to use external SSH (i.e., the one shipped with Windows)
  • I have Git installed and configured to use bundled SSH (i.e., the one shipped within Git)

@christian-monch
Copy link

christian-monch commented Sep 14, 2023

I am still not sure what the root cause of the problem is.

On Windows 10, in cmd.exe and powershell.exe in an interactive python sessions, the following will end up with a hanging python interpreter:

>>> import subprocess
>>> subprocess.run(['ssh', 'unix-machine', 'uname'])
>>> print('after run')

(Weirdly enough, no other python interpreter started in the same powershell session (after killing the hanging python process) is accepting interactive input).

  • If I put the commands into a file, e.g. hangtest.py and execute them via python hangtest.py, the interpreter prints after run and exits.
  • If I execute the command with python -i hangtest.py, the interpreter prints after run, and hangs.
  • If I add stdin=subprocess.DEVNULL or stdin=subprocess.PIPE, to the run-parameters, the interpreter prints after run and does not hang.

Interpretation

Without overriding stdin, the subprocess, i.e. ssh, and python share the same file pointer. It seems that stdin is configured in a way that unexpected by the interpreter and messes with python's way to read from sys.stdin. For example, running the following program would hang after printing the prompt enter something> :

import subprocess
subprocess.run(['ssh', 'unix-machine', 'uname'])
print('after run')
print(input('enter something> '))

But this program will print out individual characters read from sys.stdin:

import sys
import subprocess
subprocess.run(['ssh', 'unix-machine', 'uname'])
print('after run')
while True:
    print(read(sys.stdin))

Not sure about the conclusion yet.

[Edit]
It might well be that ssh configures the stdin descriptor in a way that is unexpected by python.

[Edit 2]
Until the real cause of the described problems is found, it seems that, on Windows, we should never let ssh touch a stdin-file descriptor that is also used by the python interpreter itself. I.e. always specify stdin=PIPE or stdin=DEVNUL or similar and feed input into the resulting ssh-input descriptor ourselves.

@mih
Copy link
Member Author

mih commented Sep 15, 2023

This all sounds like we understand how to avoid the problem. What I do not understand is what a fix could look like.

We need a fix that is SSH-specific. But the changes in handling described here seem only possible deep in the Runner code. Runner.run() docs say

stdin : file-like, bytes, Queue, or None
          If stdin is a file-like, it will be directly used as stdin for the
          subprocess. The caller is responsible for writing to it and closing it.
          If stdin is a bytes, it will be fed to stdin of the subprocess.
          If all data is written, stdin will be closed.
          If stdin is a Queue, all elements (bytes) put into the Queue will
          be passed to stdin until None is read from the queue. If None is read,
          stdin of the subprocess is closed.

Looking further, ThreadedRunner docs clarify

If stdin is None, nothing will be sent to stdin of the subprocess. More precisely, `subprocess.Popen` will be called with `stdin=None`.

That seems to imply this can only be fixed by supporting an additional argument type/value for stdin in the Runner code hierarchy, which is then used by a select code path for ssh client execution on windows (only).

Ping @christian-monch

@christian-monch
Copy link

christian-monch commented Sep 15, 2023

This all sounds like we understand how to avoid the problem. What I do not understand is what a fix could look like.

We need a fix that is SSH-specific. But the changes in handling described here seem only possible deep in the Runner code.
[...]
That seems to imply this can only be fixed by supporting an additional argument type/value for stdin in the Runner code hierarchy, which is then used by a select code path for ssh client execution on windows (only).
[...]

I think it can and should be fixed by calling the runner differently when executing ssh in Windows. The current runner code supports the "solution" outlined above (btw: the implementation of SSHRemoteIO in datalad-core does not use only the runner for ssh-subprocesses, but also direct ssh invocation). I will dig into a solution on my Windows machine.

[...] supporting an additional argument type/value for stdin in the Runner code hierarchy [...]

Although this might not be required for the solution of this problem, it might be a good idea to support DEVNULL. A workaround would be to pass an empty byte-array, i.e. stdin=b''.

I am on it.

@christian-monch
Copy link

christian-monch commented Sep 15, 2023

Isolating the interpreters stdin-descriptor from ssh does indeed solve the problem in Windows. For example, the hanging code from above:

>>> import datalad
>>> from datalad.support.sshconnector import SSHManager as sshman
>>> sm = sshman()
>>> url = 'ssh://sshuser@datalad-test-ria:2222'
>>> con = sm.get_connection(url)
>>> con('uname')
[DEBUG] NoMultiplexSSHConnection(sshri=<<SSHRI(hostname++47 chars++er')>>) is used to run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname']
[DEBUG] Run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] (protocol_class=StdOutErrCapture) (cwd=None)
[DEBUG] Finished ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] with status 0
('Linux\n', '')
>>>
<hangs>

Will not hang, with the line:

con('uname')

is replaced with

con('uname', stdin=b'')

I am looking into the code-paths that lead to ssh touching the interpreter's stdin-stream to see whether we can prevent that in the new ria-remote implementation, without any changes in datalad-core

@mih
Copy link
Member Author

mih commented Sep 15, 2023

Before this is closed, we must file a companion issue in datalad-core.

mih added a commit that referenced this issue Sep 15, 2023
mih added a commit that referenced this issue Sep 18, 2023
This patch aims to fix a hanging Python sessions after the execution
of an SSH remote command call with no particular stdin input.

Interpretation from #68

    Without overriding stdin, the subprocess, i.e. ssh, and python share
    the same file pointer. It seems that stdin is configured in a way that
    unexpected by the interpreter and messes with python's way to read from
    sys.stdin.

This patch passes an explicit `b''` as `stdin` to the SSH client
execution process to effectively achieve a separate fiel descriptor for
that client process.

This patch should not interfere with the implementation of the `sshrun`
command in datalad-core. It uses a dedicated not-None value for any
execution. However, the compatibility and interference of this patch
should be subject to a thorough investigation and widespread testing
before this changeset is proposed for datalad-core.

Closes #68
mih added a commit that referenced this issue Sep 19, 2023
This patch aims to fix a hanging Python sessions after the execution
of an SSH remote command call with no particular stdin input.

Interpretation from #68

    Without overriding stdin, the subprocess, i.e. ssh, and python share
    the same file pointer. It seems that stdin is configured in a way that
    unexpected by the interpreter and messes with python's way to read from
    sys.stdin.

This patch passes an explicit `b''` as `stdin` to the SSH client
execution process to effectively achieve a separate fiel descriptor for
that client process.

This patch should not interfere with the implementation of the `sshrun`
command in datalad-core. It uses a dedicated not-None value for any
execution. However, the compatibility and interference of this patch
should be subject to a thorough investigation and widespread testing
before this changeset is proposed for datalad-core.

Closes #68
@mih mih closed this as completed in 53a3bfb Sep 19, 2023
@mih
Copy link
Member Author

mih commented Sep 20, 2023

Reopening: Some breakage is fixed, but SSHRemoteIO is still not functional. This is only the case with #69

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants