Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Services started before node identity is established on Hetzner Cloud #10105

Closed
echozio opened this issue Jan 9, 2025 · 11 comments · Fixed by siderolabs/cluster-api-bootstrap-provider-talos#202

Comments

@echozio
Copy link

echozio commented Jan 9, 2025

Bug Report

Description

It appears that when the hostname is not set thru the TalosConfig, Talos will start services before the final hostname is set, causing control plane nodes to join etcd with the wrong hostname and never reaching a healthy state.

This seems like some sort of race condition, since it does work sometimes, and nearly if not always works after wiping the EPHEMERAL volume and rebooting.

Suspecting that the problem may be due to something on Hetzner's end not being ready I've attempted to work around this by adding more delay to GRUB, but even with several minutes of delay this issue still occurs.

In my case these nodes are provisioned with Cluster API using CACPPT and CAPI, the former of which forcibly removes improperly named etcd members, but bypassing this behavior does not appear to solve the problem, instead leaving the node as a healthy etcd member with the wrong name and unable to otherwise function as a control plane.

Nodes seem to fail to fetch the hostname on the first attempt:

user: warning: [2025-01-06T02:06:31.200359456Z]: [talos] retrying error: failed to download config from "http://169.254.169.254/hetzner/v1/metadata/hostname": Get "http://169.254.169.254/hetzner/v1/metadata/hostname": dial tcp 169.254.169.254:80: connect: network is unreachable

Then later setting a generated hostname and finally setting the correct hostname from Hetzner:

user: warning: [2025-01-06T02:06:42.596664456Z]: [talos] setting hostname {"component": "controller-runtime", "controller": "network.HostnameSpecController", "hostname": "talos-8qy-c5p", "domainname": ""}
user: warning: [2025-01-06T02:06:42.620658456Z]: [talos] setting hostname {"component": "controller-runtime", "controller": "network.HostnameSpecController", "hostname": "example-controlplane-3d3d26d1-69c5k", "domainname": ""}

In this example the timestamps are very close, presumably because this is not from a faulty node, however I've seen both of these events occur in that order on every faulty node.

It seems like the solution would be to wait with starting services until the Hetzner integration is finished, however I've not been able to find the code responsible for this so I'm not too sure how this works or what is even responsible for getting this information.

Environment

  • Talos version: v1.9.1
  • Kubernetes version: v1.32.0
  • Platform: Hetzner Cloud

Not sure exactly at what version I started seeing this problem (it happens on at least 4/5 new control plane nodes), but it's definitely been happening on v1.8.x and v1.9.x, possibly also v1.7.x.

@smira
Copy link
Member

smira commented Jan 9, 2025

CACPPT is not receiving lots of support, Omni is a better solution in general (and doesn't care about hostnames).

But still, please submit full boot logs of a node in a failed state to understand the issue better.

@echozio
Copy link
Author

echozio commented Jan 9, 2025

I don't think issue is directly related to the use of CACPPT, but here are the logs: https://gist.github.com/echozio/4e15617f276da492c40d61ae832fa907

As you can see it fetches the hcloud network config and sets the correct hostname after starting a number of services (etcd among others):

10.0.0.13: user: warning: [2025-01-09T10:54:41.36885272Z]: [talos] phase startEverything (15/15): 1 tasks(s)
10.0.0.13: user: warning: [2025-01-09T10:54:41.36886672Z]: [talos] task startAllServices (1/1): starting
10.0.0.13: user: warning: [2025-01-09T10:54:41.36895772Z]: [talos] service[cri](Starting): Starting service
10.0.0.13: user: warning: [2025-01-09T10:54:41.36897672Z]: [talos] service[cri](Waiting): Waiting for network
10.0.0.13: user: warning: [2025-01-09T10:54:41.36907872Z]: [talos] service[cri](Preparing): Running pre state
10.0.0.13: user: warning: [2025-01-09T10:54:41.37716672Z]: [talos] service[cri](Preparing): Creating service runner
10.0.0.13: user: warning: [2025-01-09T10:54:41.37838672Z]: [talos] service[trustd](Starting): Starting service
10.0.0.13: user: warning: [2025-01-09T10:54:41.37904072Z]: [talos] service[trustd](Waiting): Waiting for service "containerd" to be "up", time sync, network
10.0.0.13: user: warning: [2025-01-09T10:54:41.38030772Z]: [talos] service[etcd](Starting): Starting service
10.0.0.13: user: warning: [2025-01-09T10:54:41.38117772Z]: [talos] service[etcd](Waiting): Waiting for service "cri" to be "up", time sync, network, etcd spec
10.0.0.13: user: warning: [2025-01-09T10:54:41.38272172Z]: [talos] task startAllServices (1/1): waiting for 11 services
10.0.0.13: user: warning: [2025-01-09T10:54:41.38355672Z]: [talos] service[trustd](Preparing): Running pre state
10.0.0.13: user: warning: [2025-01-09T10:54:41.38467572Z]: [talos] task startAllServices (1/1): service "apid" to be "up", service "auditd" to be "up", service "containerd" to be "up", service "cri" to be "up", service "dashboard" to be "up", service "etcd" to be "up", service "kubelet" to be "up", service "machined" to be "up", service "syslogd" to be "up", service "trustd" to be "up", service "udevd" to be "up"
10.0.0.13: user: warning: [2025-01-09T10:54:41.38981972Z]: [talos] service[trustd](Preparing): Creating service runner
10.0.0.13: user: warning: [2025-01-09T10:54:41.40532872Z]: [talos] service[cri](Running): Process Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]) started with PID 2306
10.0.0.13: user: warning: [2025-01-09T10:54:41.58448572Z]: [talos] service[apid](Running): Started task apid (PID 2376) for container apid
10.0.0.13: user: warning: [2025-01-09T10:54:41.59172572Z]: [talos] service[trustd](Running): Started task trustd (PID 2377) for container trustd
10.0.0.13: user: warning: [2025-01-09T10:54:41.74993472Z]: [talos] service[kubelet](Starting): Starting service
10.0.0.13: user: warning: [2025-01-09T10:54:41.75010872Z]: [talos] service[kubelet](Waiting): Waiting for service "cri" to be "up", time sync, network
10.0.0.13: user: warning: [2025-01-09T10:54:42.27609372Z]: [talos] service[apid](Running): Health check successful
10.0.0.13: user: warning: [2025-01-09T10:54:42.35354172Z]: [talos] service[etcd](Waiting): Waiting for service "cri" to be "up"
10.0.0.13: user: warning: [2025-01-09T10:54:42.35567772Z]: [talos] service[cri](Running): Health check successful
10.0.0.13: user: warning: [2025-01-09T10:54:42.35765472Z]: [talos] service[kubelet](Preparing): Running pre state
10.0.0.13: user: warning: [2025-01-09T10:54:42.35862372Z]: [talos] service[etcd](Preparing): Running pre state
10.0.0.13: user: warning: [2025-01-09T10:54:42.36975872Z]: [talos] service[trustd](Running): Health check successful
10.0.0.13: user: warning: [2025-01-09T10:54:42.58475872Z]: [talos] fetching hcloud network config from: "http://169.254.169.254/hetzner/v1/metadata/network-config"
10.0.0.13: user: warning: [2025-01-09T10:54:42.59128272Z]: [talos] setting hostname {"component": "controller-runtime", "controller": "network.HostnameSpecController", "hostname": "example-controlplane-3d3d26d1-ngr7t", "domainname": ""}
10.0.0.13: user: warning: [2025-01-09T10:54:42.59306572Z]: [talos] setting resolvers {"component": "controller-runtime", "controller": "network.ResolverSpecController", "resolvers": ["1.1.1.1", "8.8.8.8"], "searchDomains": []}

@smira
Copy link
Member

smira commented Jan 9, 2025

I'm not familiar enough with Hetzner cloud, but looks like network got configured via DHCP, while hostname was not sent via DHCP. Hetzner cloud relies on network to be configured to download metadata, so that download happens when the network is configured.

From Talos point of view, as network is configured, it is ready to start running.

So in this case I would say you need to set the hostname via machine configuration as a static one (at least for controlplane nodes), as etcd doesn't support changing member hostnames. Controlplane nodes should have a stable hostname.

The only way to fix this is to disable completely Talos default hostname (so that it would wait for Hetzner to supply one), but we don't have this feature yet.

@echozio
Copy link
Author

echozio commented Jan 9, 2025

Should not fetching this metadata from Hetzner be considered part of getting the network ready when talos.provider=hcloud is set? I'm not sure what component is responsible for declaring the network ready, but it appears machined is responsible for fetching this metadata based on looking at some of the code.

@smira
Copy link
Member

smira commented Jan 9, 2025

It's hard to say whether we should block on this or not, as HCloud provider might be down, but if the network is up, why should we block? As long as we have enough information to proceed, we should proceed.

Disabling completely default hostname is a proper fix here (and up to you to enable this).

@echozio
Copy link
Author

echozio commented Jan 9, 2025

If disabling the generated default hostname will block service execution until a hostname is set that would likely be a perfectly acceptable fix for my use case. I'm assuming it's machine.features.stableHostname=false in the TalosConfig you're referring to?

@smira
Copy link
Member

smira commented Jan 9, 2025

There's no such feature in Talos (yet).

@echozio
Copy link
Author

echozio commented Jan 9, 2025

I see. What did you mean by this?

Disabling completely default hostname is a proper fix here (and up to you to enable this).

Setting the hostname in the configuration would also solve my issue, but I'd need to get CACPPT to play ball, as it's currently creating machines where the Machine's name != the HCloudMachine's name and it only supports setting the hostname from the Machine's name. This is something I'll investigate further if there isn't currently any feature in Talos I could leverage (e.g. disabling the default hostname).

@smira
Copy link
Member

smira commented Jan 9, 2025

Disabling completely default hostname is a proper fix here (and up to you to enable this).

I mean that in general waiting for hcloud metadata doesn't make sense to me, while if you explicitly disable default hostname, Talos would wait for HCloud to provide one (or any other hostname from any other source).

@echozio
Copy link
Author

echozio commented Jan 9, 2025

I see, I agree that would be a good solution. I guess the "and up to you to enable this" lead me to believe such a feature existed and I could just enable it.

Thanks a lot for your help. I'll update this issue if I'm able to solve this by other means. If disabling the default hostnames is planned or in the works I'd be happy to help out.

@echozio
Copy link
Author

echozio commented Jan 9, 2025

I made a small change to CACPPT and set spec.controlPlaneConfig.controlplane.hostname.source=MachineName on the TalosControlPlane, which appears to have solved the issue for me. I'll close this as the observed behavior (not waiting for metadata) is not a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants