Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_threading failing randomly on CI #129266

Open
kumaraditya303 opened this issue Jan 24, 2025 · 9 comments
Open

test_threading failing randomly on CI #129266

kumaraditya303 opened this issue Jan 24, 2025 · 9 comments
Labels
tests Tests in the Lib/test dir topic-free-threading

Comments

@kumaraditya303
Copy link
Contributor

See https://github.com/python/cpython/actions/runs/12953185538/job/36134744339?pr=128869

test_reinit_tls_after_fork (test.test_threading.ThreadJoinOnShutdown.test_reinit_tls_after_fork) ... Warning -- Uncaught thread exception: AssertionError
Exception in thread Thread-34 (do_fork_and_wait):
Traceback (most recent call last):
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/threading.py", line 1054, in _bootstrap_inner
    self.run()
    ~~~~~~~~^^
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/threading.py", line 996, in run
    self._target(*self._args, **self._kwargs)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/test/test_threading.py", line 1438, in do_fork_and_wait
    support.wait_process(pid, exitcode=50)
    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/test/support/__init__.py", line 2205, in wait_process
    raise AssertionError(f"process {pid} is still running "
                         f"after {dt:.1f} seconds")
AssertionError: process 10953 is still running after 300.2 seconds
ok
@colesbury
Copy link
Contributor

The other failure in that same job is test_main_thread_after_fork_from_dummy_thread:

test_main_thread_after_fork_from_dummy_thread (test.test_threading.ThreadTests.test_main_thread_after_fork_from_dummy_thread) ... Timeout (0:10:00)!
Thread 0x0000000204a60f40 (most recent call first):
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/selectors.py", line 398 in select
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/subprocess.py", line 2121 in _communicate
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/subprocess.py", line 1218 in communicate
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/test/support/script_helper.py", line 155 in run_python_until_end
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/test/support/script_helper.py", line 165 in _assert_python
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/test/support/script_helper.py", line 182 in assert_python_ok
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/test/test_threading.py", line 836 in test_main_thread_after_fork_from_foreign_thread
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/test/test_threading.py", line 849 in test_main_thread_after_fork_from_dummy_thread
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/unittest/case.py", line 606 in _callTestMethod
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/unittest/case.py", line 660 in run
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/unittest/case.py", line 716 in __call__
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/unittest/suite.py", line 122 in run
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/unittest/suite.py", line 84 in __call__
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/unittest/suite.py", line 122 in run
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/unittest/suite.py", line 84 in __call__
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/unittest/runner.py", line 259 in run
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/test/libregrtest/single.py", line 58 in _run_suite
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/test/libregrtest/single.py", line 38 in run_unittest
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/test/libregrtest/single.py", line 136 in test_func
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/test/libregrtest/single.py", line 92 in regrtest_runner
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/test/libregrtest/single.py", line 139 in _load_run_test
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/test/libregrtest/single.py", line 184 in _runtest_env_changed_exc
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/test/libregrtest/single.py", line 293 in _runtest
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/test/libregrtest/single.py", line 322 in run_single_test
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/test/libregrtest/worker.py", line 83 in worker_process
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/test/libregrtest/worker.py", line 118 in main
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/test/libregrtest/worker.py", line 122 in <module>
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/runpy.py", line 88 in _run_code
  File "/Users/admin/actions-runner/_work/cpython/cpython/Lib/runpy.py", line 198 in _run_module_as_main

@kumaraditya303
Copy link
Contributor Author

Another failure but in multiprocessing https://github.com/python/cpython/actions/runs/12954294733/job/36135983482?pr=129267

test_timeout (test.test_multiprocessing_spawn.test_threads.WithThreadsTestSemaphore.test_timeout) ... skipped 'test not appropriate for threads'
======================================================================
FAIL: test_default_timeout (test.test_multiprocessing_spawn.test_threads.WithThreadsTestBarrier.test_default_timeout)
Test the barrier's default timeout
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\cpython\cpython\Lib\test\_test_multiprocessing.py", line 2291, in test_default_timeout
    self.assertEqual(len(results), barrier.parties)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 0 != 5
----------------------------------------------------------------------
Ran 105 tests in 68.027s
FAILED (failures=1, skipped=27)

@colesbury
Copy link
Contributor

I've run test_threading locally >500 times and unfortunately can't reproduce this (or any other failures in test_threading). Sometimes these failures only happen when the machine is heavily loaded -- I'll try it again with more runs simultaneously later.

@colesbury
Copy link
Contributor

Encountered the following locally (while running 5 tests simulataneously):

======================================================================
FAIL: test_set_and_clear (test.test_threading.EventTests.test_set_and_clear)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/sgross/Projects/cpython/Lib/test/lock_tests.py", line 519, in test_set_and_clear
    with Bunch(f, N):
         ~~~~~^^^^^^
  File "/Users/sgross/Projects/cpython/Lib/test/lock_tests.py", line 77, in __exit__
    for _ in support.sleeping_retry(support.SHORT_TIMEOUT):
             ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sgross/Projects/cpython/Lib/test/support/__init__.py", line 2545, in sleeping_retry
    for _ in busy_retry(timeout, err_msg, error=error):
             ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sgross/Projects/cpython/Lib/test/support/__init__.py", line 2515, in busy_retry
    raise AssertionError(msg)
AssertionError: timeout (30.4 seconds)

----------------------------------------------------------------------
Ran 206 tests in 43.732s

@colesbury
Copy link
Contributor

I'm increasingly confident that these are bugs in the test code and not in the threading module or other parts of Python.

For example, the test in test_set_and_clear has an inherent race condition: if the main thread sets and clears the event before all of the five background threads start waiting on the event, they will never get woken up and the tests times out. wait_threads_blocked just sleeps for 0.05 seconds. There's no guarantee that all the threads will have started executing and reached the event.wait() by then, especially if the system is heavily loaded as is sometimes the case in the CI.

def test_set_and_clear(self):
# gh-57711: check that wait() returns true even when the event is
# cleared before the waiting thread is woken up.
event = self.eventtype()
results = []
def f():
results.append(event.wait(support.LONG_TIMEOUT))
N = 5
with Bunch(f, N):
# Threads blocked on event.wait()
wait_threads_blocked(N)
# Threads unblocked
event.set()
event.clear()
self.assertEqual(results, [True] * N)

@kumaraditya303
Copy link
Contributor Author

I think we should fix these tests to not rely on arbitrary sleeps for synchronization. In the past we had similar issues with using sleeps in asyncio tests so I ended up removing a lot of those and switch to barrier or other sync primitives, the same should be done for threading tests.

@colesbury
Copy link
Contributor

It's not clear how to avoid the sleeps in this test. You need to wait until the other threads are blocked on event.wait(). How do you do that?

@kumaraditya303
Copy link
Contributor Author

kumaraditya303 commented Jan 25, 2025

I think in this case we can check the waiter count of the underlying condition used _cond. The waiter count isn't exposed as a public attribute unlike barrier's n_waiting so would need to check len(_cond._waiters).

@kumaraditya303
Copy link
Contributor Author

I tested the following by adding a artificial delay before waiting on the events in threads and seems to work.

diff --git a/Lib/test/lock_tests.py b/Lib/test/lock_tests.py
index 8c8f8901f00..9e9bbed4f2f 100644
--- a/Lib/test/lock_tests.py
+++ b/Lib/test/lock_tests.py
@@ -513,12 +513,14 @@ def test_set_and_clear(self):
         event = self.eventtype()
         results = []
         def f():
+            time.sleep(0.1)
             results.append(event.wait(support.LONG_TIMEOUT))
 
         N = 5
         with Bunch(f, N):
             # Threads blocked on event.wait()
-            wait_threads_blocked(N)
+            while len(event._cond._waiters) != N:
+                time.sleep(0.01)
 
             # Threads unblocked
             event.set()

@picnixz picnixz added the tests Tests in the Lib/test dir label Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Tests in the Lib/test dir topic-free-threading
Projects
None yet
Development

No branches or pull requests

3 participants