-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Services started before node identity is established on Hetzner Cloud #10105
Services started before node identity is established on Hetzner Cloud #10105
Comments
CACPPT is not receiving lots of support, Omni is a better solution in general (and doesn't care about hostnames). But still, please submit full boot logs of a node in a failed state to understand the issue better. |
I don't think issue is directly related to the use of CACPPT, but here are the logs: https://gist.github.com/echozio/4e15617f276da492c40d61ae832fa907 As you can see it fetches the hcloud network config and sets the correct hostname after starting a number of services (etcd among others):
|
I'm not familiar enough with Hetzner cloud, but looks like network got configured via DHCP, while hostname was not sent via DHCP. Hetzner cloud relies on network to be configured to download metadata, so that download happens when the network is configured. From Talos point of view, as network is configured, it is ready to start running. So in this case I would say you need to set the hostname via machine configuration as a static one (at least for controlplane nodes), as etcd doesn't support changing member hostnames. Controlplane nodes should have a stable hostname. The only way to fix this is to disable completely Talos default hostname (so that it would wait for Hetzner to supply one), but we don't have this feature yet. |
Should not fetching this metadata from Hetzner be considered part of getting the network ready when |
It's hard to say whether we should block on this or not, as HCloud provider might be down, but if the network is up, why should we block? As long as we have enough information to proceed, we should proceed. Disabling completely default hostname is a proper fix here (and up to you to enable this). |
If disabling the generated default hostname will block service execution until a hostname is set that would likely be a perfectly acceptable fix for my use case. I'm assuming it's |
There's no such feature in Talos (yet). |
I see. What did you mean by this?
Setting the hostname in the configuration would also solve my issue, but I'd need to get CACPPT to play ball, as it's currently creating machines where the Machine's name != the HCloudMachine's name and it only supports setting the hostname from the Machine's name. This is something I'll investigate further if there isn't currently any feature in Talos I could leverage (e.g. disabling the default hostname). |
I mean that in general waiting for hcloud metadata doesn't make sense to me, while if you explicitly disable default hostname, Talos would wait for HCloud to provide one (or any other hostname from any other source). |
I see, I agree that would be a good solution. I guess the "and up to you to enable this" lead me to believe such a feature existed and I could just enable it. Thanks a lot for your help. I'll update this issue if I'm able to solve this by other means. If disabling the default hostnames is planned or in the works I'd be happy to help out. |
I made a small change to CACPPT and set |
Bug Report
Description
It appears that when the hostname is not set thru the TalosConfig, Talos will start services before the final hostname is set, causing control plane nodes to join etcd with the wrong hostname and never reaching a healthy state.
This seems like some sort of race condition, since it does work sometimes, and nearly if not always works after wiping the EPHEMERAL volume and rebooting.
Suspecting that the problem may be due to something on Hetzner's end not being ready I've attempted to work around this by adding more delay to GRUB, but even with several minutes of delay this issue still occurs.
In my case these nodes are provisioned with Cluster API using CACPPT and CAPI, the former of which forcibly removes improperly named etcd members, but bypassing this behavior does not appear to solve the problem, instead leaving the node as a healthy etcd member with the wrong name and unable to otherwise function as a control plane.
Nodes seem to fail to fetch the hostname on the first attempt:
Then later setting a generated hostname and finally setting the correct hostname from Hetzner:
In this example the timestamps are very close, presumably because this is not from a faulty node, however I've seen both of these events occur in that order on every faulty node.
It seems like the solution would be to wait with starting services until the Hetzner integration is finished, however I've not been able to find the code responsible for this so I'm not too sure how this works or what is even responsible for getting this information.
Environment
Not sure exactly at what version I started seeing this problem (it happens on at least 4/5 new control plane nodes), but it's definitely been happening on v1.8.x and v1.9.x, possibly also v1.7.x.
The text was updated successfully, but these errors were encountered: