Node install stalls because of large retry count #107

ctrox · 2023-02-16T07:55:32Z

On RKE2 we have observed that the machine-provision pod can sometimes be stuck for hours due to the very large retry count of 4500. This mainly seems to happen in retrieve_connection_info, which by the way does not exit 1 even once it is done with all the retries.

Regardless of the actual cause making retrieve_connection_info fail all the time, wouldn't it make sense to have a more reasonable RETRY_COUNT here? This would cause the provisioning to fail faster and retry by creating a whole new machine.

The text was updated successfully, but these errors were encountered:

Jono-SUSE-Rancher added the [zube]: To Triage label Mar 28, 2023

kgtw mentioned this issue Nov 12, 2023

Introduce exponential backoff for install.sh #146

Open

Jono-SUSE-Rancher removed the [zube]: To Triage label May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node install stalls because of large retry count #107

Node install stalls because of large retry count #107

ctrox commented Feb 16, 2023

Node install stalls because of large retry count #107

Node install stalls because of large retry count #107

Comments

ctrox commented Feb 16, 2023