Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interchange hang or SIGABRT on kill_event driven exit #3697

Open
benclifford opened this issue Nov 14, 2024 · 3 comments
Open

Interchange hang or SIGABRT on kill_event driven exit #3697

benclifford opened this issue Nov 14, 2024 · 3 comments

Comments

@benclifford
Copy link
Collaborator

Describe the bug
There are a few paths through which the interchange exits. The regular shutdown path, driven by the DFK, is to send a SIGTERM which immediately kills the process.

Another rare path is using kill_event which is polled every 10ms, and is set when a particular form of incorrect worker registration is received.

When that kill_event path is taken, the interchange exits with a SIGABRT, placing this (or a variant) on stderr:

Exception in thread Interchange-Task-Puller:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.12/threading.py", line 1012, in run
Exception in thread Interchange-Command:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self._target(*self._args, **self._kwargs)
  File "/home/benc/parsl/src/parsl/parsl/process_loggers.py", line 26, in wrapped
    self.run()
  File "/usr/local/lib/python3.12/threading.py", line 1012, in run
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/benc/parsl/src/parsl/parsl/executors/high_throughput/interchange.py", line 213, in task_puller
    self._target(*self._args, **self._kwargs)
    msg = self.task_incoming.recv_pyobj()
  File "/home/benc/parsl/src/parsl/parsl/process_loggers.py", line 26, in wrapped
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    r = func(*args, **kwargs)
  File "/home/benc/parsl/virtualenv-3.12/lib/python3.12/site-packages/zmq/sugar/socket.py", line 975, in recv_pyobj
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/benc/parsl/src/parsl/parsl/executors/high_throughput/interchange.py", line 251, in _command_server
    command_req = self.command_channel.recv_pyobj()
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    msg = self.recv(flags)
  File "/home/benc/parsl/virtualenv-3.12/lib/python3.12/site-packages/zmq/sugar/socket.py", line 975, in recv_pyobj
          ^^^^^^^^^^^^^^^^
  File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv
    msg = self.recv(flags)
Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name='<stderr>'> at interpreter shutdown, possibly due to daemon threads
Python runtime state: finalizing (tstate=0x0000560f81d13238)

Current thread 0x00007fcb0a6eb740 (most recent call first):
  <no Python frame>

Extension modules: zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils, setproctitle._setproctitle, sqlalchemy.cimmutabledict, greenlet._greenlet, sqlalchemy.cprocessors, sqlalchemy.cresultproxy, psutil._psutil_linux, psutil._psutil_posix, charset_normalizer.md, _cffi_backend, yaml._yaml, ndcctools._cwork_queue, ndcctools._cresource_monitor (total: 21)

The interchange then exits (as desired) but with unix exit code -6, SIGABRT.

This is probably mostly cosmetic: the interchange still exits.

To Reproduce
I will make a pull request with a demonstrator test.

Expected behavior
clean exit

Environment
my laptop, branched from Parsl 2024.11.11

@benclifford
Copy link
Collaborator Author

To recreate, run the test in #3698 with stderr/streams enabled:

$ !p
pytest  parsl/tests/test_htex/test_interchange_exit_bad_registration.py --config local -s
========================================== test session starts ===========================================
platform linux -- Python 3.12.6+, pytest-7.4.4, pluggy-1.4.0
Test order randomisation NOT enabled. Enable with --random-order or --random-order-bucket=<bucket_type>
rootdir: /home/benc/parsl/src/parsl/parsl/tests
configfile: pytest.ini
plugins: random-order-1.1.1, typeguard-2.13.3, cov-4.1.0, hypothesis-6.103.1
collected 1 item                                                                                         

parsl/tests/test_htex/test_interchange_exit_bad_registration.py /home/benc/parsl/virtualenv-3.12/bin/interchange.py:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').require('parsl==1.3.0.dev0')
Exception in thread Interchange-Task-Puller:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.12/threading.py", line 1012, in run
Exception in thread Interchange-Command:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self._target(*self._args, **self._kwargs)
  File "/home/benc/parsl/src/parsl/parsl/process_loggers.py", line 26, in wrapped
    self.run()
  File "/usr/local/lib/python3.12/threading.py", line 1012, in run
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/benc/parsl/src/parsl/parsl/executors/high_throughput/interchange.py", line 213, in task_puller
    msg = self.task_incoming.recv_pyobj()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benc/parsl/virtualenv-3.12/lib/python3.12/site-packages/zmq/sugar/socket.py", line 975, in recv_pyobj
    self._target(*self._args, **self._kwargs)
  File "/home/benc/parsl/src/parsl/parsl/process_loggers.py", line 26, in wrapped
BENC: entering zmq ctx destroy
BENC: leaving zmq ctx destroy
BENC: entering zmq ctx destroy
BENC: leaving zmq ctx destroy
BENC: entering zmq ctx destroy
BENC: leaving zmq ctx destroy
.

============================================ warnings summary ============================================
../../virtualenv-3.12/lib/python3.12/site-packages/dateutil/tz/tz.py:37
  /home/benc/parsl/virtualenv-3.12/lib/python3.12/site-packages/dateutil/tz/tz.py:37: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
    EPOCH = datetime.datetime.utcfromtimestamp(0)

parsl/executors/workqueue/executor.py:43
  /home/benc/parsl/src/parsl/parsl/executors/workqueue/executor.py:43: DeprecationWarning: 'import work_queue' is deprecated. Please instead use: 'import ndcctools.work_queue'
    import work_queue as wq

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================== 1 passed, 2 warnings in 11.54s =====================================

@benclifford
Copy link
Collaborator Author

In some situations in my replicator test, the interchange will exit with this jumbled pair of stack traces, but unix exit code 0, not -6:

Exception in thread Interchange-Task-Puller:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.12/threading.py", line 1012, in run
Exception in thread Interchange-Command:
Traceback (most recent call last):
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
  File "/home/benc/parsl/src/parsl/parsl/process_loggers.py", line 26, in wrapped
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/benc/parsl/src/parsl/parsl/executors/high_throughput/interchange.py", line 213, in task_puller
    msg = self.task_incoming.recv_pyobj()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benc/parsl/virtualenv-3.12/lib/python3.12/site-packages/zmq/sugar/socket.py", line 975, in recv_pyobj
    self.run()
  File "/usr/local/lib/python3.12/threading.py", line 1012, in run
    msg = self.recv(flags)
          ^^^^^^^^^^^^^^^^
  File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv

github-merge-queue bot pushed a commit that referenced this issue Jan 16, 2025
On certain bad registration messages, the interchange should exit
immediately. This tests that.

See #3697 for some bad (cosmetic?) behaviour here - the interchange
SIGABRTs on this code path rather than exiting cleanly, and this test
includes a commented out assert that could check for clean exit (in
addition to checking that the interchange process exits at all)

## Type of change

- Code maintenance/cleanup
@benclifford
Copy link
Collaborator Author

The test introduced in #3698 appears to fail (with a hung interchange process and different ZMQ errors on stderr) - so although I labelled this issue as only cosmetic, it appears not to be.

@benclifford benclifford changed the title Interchange SIGABRT on kill_event driven exit (only cosmetic?) Interchange hang or SIGABRT on kill_event driven exit Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant