[14 Jan 2025] - [sidecar] logs are getting flooded as sidecar code is have endless Retry mechanism #5

asundriya · 2025-01-15T05:22:13Z

What happened:
logs are getting flooded as Drive is have endless Retry mechanism. WE can say spamming is happening on the failure command.

What you expected to happen:
logs should have a counter for retry when we see errors while bucket creation/access/grant.
As same error message is repeated day/night if failure is not fixed. User should have a control, how many time system should retry

How to reproduce this bug (as minimally and precisely as possible):

Induce an error while bucket any of the workflow creation/access/grant.
Look into the provisioner log, you will see same error message is repeated day/night if failure is not fixed.
If the issue remains for couple of days, then it will eat all memory and space of the system

Issue is
The log handling for the COSI APIs is handled by the sidecar (https://github.com/kubernetes-sigs/container-object-storage-interface-provisioner-sidecar) where if an error is occured it endless goes on a retry mechanism. However, the sidecar does not currently stop retrying after some time. We should have a tweaking counter for the same

BlaineEXE · 2025-01-24T18:42:59Z

The sidecar is expected to be an operator of sorts, so I do expect it to retry until it's successful. That's part of the control theory that keeps Kuberentes and its systems stable.

That said, normally I also expect some sort of backoff mechanism. I would propose that the fix for this issue report focus on ensuring a reaosonable backoff. We can probably start with the backoff strategy and timing recommended by controller-runtime (reference needed) and readjust as needed in the future if this continues to come up.

BlaineEXE · 2025-01-24T18:51:40Z

One of my first thoughts when triaging/planning bugs is what sort of testing is needed. Regression tests are very important.

For bugs related to reconcile retry timing and logging, I have found that timing-related log-output expectations are hard to codify. In Rook, we tend to not create regression tests for these cases and instead just try to do our best to make sure system internals are logging helpful info without frequent spam.

@shanduur , my inclination here is to not require deeply involved unit/e2e tests, but I'm curious to your input here as well.

BlaineEXE changed the title ~~[DATE] - [sidecar] logs are getting flooded as sidecar code is have endless Retry mechanism~~ [14 Jan 2025] - [sidecar] logs are getting flooded as sidecar code is have endless Retry mechanism Jan 24, 2025

BlaineEXE added this to Container Object Storage Interface Jan 24, 2025

github-project-automation bot moved this to To do in Container Object Storage Interface Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[14 Jan 2025] - [sidecar] logs are getting flooded as sidecar code is have endless Retry mechanism #5

[14 Jan 2025] - [sidecar] logs are getting flooded as sidecar code is have endless Retry mechanism #5

asundriya commented Jan 15, 2025

BlaineEXE commented Jan 24, 2025

BlaineEXE commented Jan 24, 2025

[14 Jan 2025] - [sidecar] logs are getting flooded as sidecar code is have endless Retry mechanism #5

[14 Jan 2025] - [sidecar] logs are getting flooded as sidecar code is have endless Retry mechanism #5

Comments

asundriya commented Jan 15, 2025

BlaineEXE commented Jan 24, 2025

BlaineEXE commented Jan 24, 2025