Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PYTHON-5044 - Fix successive AsyncMongoClients on a single loop always ti… #2065

Merged
merged 6 commits into from
Jan 22, 2025

Conversation

NoahStapp
Copy link
Contributor

…meout on server selection

@NoahStapp NoahStapp requested a review from ShaneHarvey January 17, 2025 18:51
@NoahStapp
Copy link
Contributor Author

NoahStapp commented Jan 17, 2025

The test failures are due to unittest's handling of async, so these changes will have to be included as part of the migration to pytest.

@@ -1565,6 +1565,8 @@ async def close(self) -> None:
# TODO: PYTHON-1921 Encrypted MongoClients cannot be re-opened.
await self._encrypter.close()
self._closed = True
# Yield to the asyncio event loop so all executor tasks properly exit after cancellation
await asyncio.sleep(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yielding here doesn't guarantee anything about task cleanup. Shouldn't we actually await the tasks to ensure they are cleaned up properly?

@@ -75,6 +75,8 @@ def close(self, dummy: Any = None) -> None:
callback; see monitor.py.
"""
self._stopped = True
if self._task:
self._task.cancel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering how this relates to the issue described in DRIVERS-3076. Like will calling cancel here change the user visible events a Monitor emits on close()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a task is cancelled, it should stop executing on the next iteration of the event loop. Since I believe the CancelledError is thrown from the next await call inside the cancelled task, it's possible that the Monitor emits events differently between cancellations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, if we use this approach then we won't emit the expected ServerHeartbeatFailedEvent on cancellation. Do we need this change in this PR anymore? Can we defer it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Now that we aren't awaiting background tasks on close in this PR, this change is unneeded.

@ShaneHarvey
Copy link
Member

Any AsyncMongoClient created after the first on a given asyncio event loop will experience a server selection timeout on its first operation. With the default timeout configuration, this will cause a 20-second delay for this first operation, for each client after the first.

How does this cause a 20 second delay? Which task is blocking the loop?

@ShaneHarvey
Copy link
Member

Ah we identified the root cause now. The loop blocks because after the wait() raises CancelledError, the loop.sock_recv_into task remains running, then we call sock.settimeout(sock_timeout) which mutates the socket to be blocking.

So this fix does work because it ensures we cancel the loop.sock_recv_into task before updating the socket timeout.

@NoahStapp NoahStapp requested a review from ShaneHarvey January 21, 2025 16:22
@NoahStapp
Copy link
Contributor Author

The failures are due to our spec runner calling killAllSessions before and after each run on every client it owns, which causes crossing of event loops due to unittest's one loop per test structure. Investigating a fix currently.

@@ -191,6 +191,8 @@ def gc_safe_close(self) -> None:

async def close(self) -> None:
self.gc_safe_close()
if not _IS_SYNC:
await self._executor.join()
Copy link
Member

@ShaneHarvey ShaneHarvey Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can call join here because there are cases where close() gets called by the monitor thread/task itself. Joining on yourself will cause the thread/task to hang.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we remove these changes an open a new issue to track improving the cleanup behavior? Then this PR can be focused on just the network_layer changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not invest time in fixing the cleanup behavior for a test suite we're already working on refactoring. If we're fine with the tests throwing some warnings during the conversion to pytest I'd prefer to just let them throw.

Copy link
Member

@ShaneHarvey ShaneHarvey Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand your comment. Don't we eventually need client.close() to await all the background tasks? That's what I think we need a new ticket to track. This isn't really a test suite issue, it's something end users will encounter when closing clients too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more referring to the work required to get the correct cleanup behavior functioning in our existing test suite. We already have quite a few workarounds for the async test suite to work within the current structure. I expect fixing this issue will only add onto that burden at the same time we're also refactoring the suite entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the core issue of AsyncMongoClient.close() not awaiting all its background tasks needs to be addressed. I'm worried that in doing so with our current test suite, we'll be doing significant additional work that will be thrown out as part of the pytest refactor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Yes I agree with you on the unittest specific changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open a separate ticket for the AsyncMongoClient.close() changes, but we'll need to decide what to do if the unittest suite doesn't work with them.

@@ -191,6 +191,8 @@ def gc_safe_close(self) -> None:

async def close(self) -> None:
self.gc_safe_close()
if not _IS_SYNC:
await self._executor.join()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we remove these changes an open a new issue to track improving the cleanup behavior? Then this PR can be focused on just the network_layer changes.

@NoahStapp NoahStapp requested a review from ShaneHarvey January 21, 2025 21:00
Copy link
Member

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, once the tests pass.

@NoahStapp NoahStapp merged commit f1af917 into mongodb:master Jan 22, 2025
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants