Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[14 Jan 2025] - [sidecar] logs are getting flooded as sidecar code is have endless Retry mechanism #5

Open
asundriya opened this issue Jan 15, 2025 · 2 comments

Comments

@asundriya
Copy link

What happened:
logs are getting flooded as Drive is have endless Retry mechanism. WE can say spamming is happening on the failure command.

What you expected to happen:
logs should have a counter for retry when we see errors while bucket creation/access/grant.
As same error message is repeated day/night if failure is not fixed. User should have a control, how many time system should retry

How to reproduce this bug (as minimally and precisely as possible):

  1. Induce an error while bucket any of the workflow creation/access/grant.
  2. Look into the provisioner log, you will see same error message is repeated day/night if failure is not fixed.
    If the issue remains for couple of days, then it will eat all memory and space of the system

Issue is
The log handling for the COSI APIs is handled by the sidecar (https://github.com/kubernetes-sigs/container-object-storage-interface-provisioner-sidecar) where if an error is occured it endless goes on a retry mechanism. However, the sidecar does not currently stop retrying after some time. We should have a tweaking counter for the same

@BlaineEXE BlaineEXE changed the title [DATE] - [sidecar] logs are getting flooded as sidecar code is have endless Retry mechanism [14 Jan 2025] - [sidecar] logs are getting flooded as sidecar code is have endless Retry mechanism Jan 24, 2025
@BlaineEXE
Copy link
Contributor

The sidecar is expected to be an operator of sorts, so I do expect it to retry until it's successful. That's part of the control theory that keeps Kuberentes and its systems stable.

That said, normally I also expect some sort of backoff mechanism. I would propose that the fix for this issue report focus on ensuring a reaosonable backoff. We can probably start with the backoff strategy and timing recommended by controller-runtime (reference needed) and readjust as needed in the future if this continues to come up.

@BlaineEXE
Copy link
Contributor

One of my first thoughts when triaging/planning bugs is what sort of testing is needed. Regression tests are very important.

For bugs related to reconcile retry timing and logging, I have found that timing-related log-output expectations are hard to codify. In Rook, we tend to not create regression tests for these cases and instead just try to do our best to make sure system internals are logging helpful info without frequent spam.

@shanduur , my inclination here is to not require deeply involved unit/e2e tests, but I'm curious to your input here as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants