Services started before node identity is established on Hetzner Cloud #10105

echozio · 2025-01-09T10:26:38Z

Bug Report

Description

It appears that when the hostname is not set thru the TalosConfig, Talos will start services before the final hostname is set, causing control plane nodes to join etcd with the wrong hostname and never reaching a healthy state.

This seems like some sort of race condition, since it does work sometimes, and nearly if not always works after wiping the EPHEMERAL volume and rebooting.

Suspecting that the problem may be due to something on Hetzner's end not being ready I've attempted to work around this by adding more delay to GRUB, but even with several minutes of delay this issue still occurs.

In my case these nodes are provisioned with Cluster API using CACPPT and CAPI, the former of which forcibly removes improperly named etcd members, but bypassing this behavior does not appear to solve the problem, instead leaving the node as a healthy etcd member with the wrong name and unable to otherwise function as a control plane.

Nodes seem to fail to fetch the hostname on the first attempt:

user: warning: [2025-01-06T02:06:31.200359456Z]: [talos] retrying error: failed to download config from "http://169.254.169.254/hetzner/v1/metadata/hostname": Get "http://169.254.169.254/hetzner/v1/metadata/hostname": dial tcp 169.254.169.254:80: connect: network is unreachable

Then later setting a generated hostname and finally setting the correct hostname from Hetzner:

user: warning: [2025-01-06T02:06:42.596664456Z]: [talos] setting hostname {"component": "controller-runtime", "controller": "network.HostnameSpecController", "hostname": "talos-8qy-c5p", "domainname": ""}
user: warning: [2025-01-06T02:06:42.620658456Z]: [talos] setting hostname {"component": "controller-runtime", "controller": "network.HostnameSpecController", "hostname": "example-controlplane-3d3d26d1-69c5k", "domainname": ""}

In this example the timestamps are very close, presumably because this is not from a faulty node, however I've seen both of these events occur in that order on every faulty node.

It seems like the solution would be to wait with starting services until the Hetzner integration is finished, however I've not been able to find the code responsible for this so I'm not too sure how this works or what is even responsible for getting this information.

Environment

Talos version: v1.9.1
Kubernetes version: v1.32.0
Platform: Hetzner Cloud

Not sure exactly at what version I started seeing this problem (it happens on at least 4/5 new control plane nodes), but it's definitely been happening on v1.8.x and v1.9.x, possibly also v1.7.x.

The text was updated successfully, but these errors were encountered:

smira · 2025-01-09T10:45:08Z

CACPPT is not receiving lots of support, Omni is a better solution in general (and doesn't care about hostnames).

But still, please submit full boot logs of a node in a failed state to understand the issue better.

echozio · 2025-01-09T11:03:28Z

I don't think issue is directly related to the use of CACPPT, but here are the logs: https://gist.github.com/echozio/4e15617f276da492c40d61ae832fa907

As you can see it fetches the hcloud network config and sets the correct hostname after starting a number of services (etcd among others):

10.0.0.13: user: warning: [2025-01-09T10:54:41.36885272Z]: [talos] phase startEverything (15/15): 1 tasks(s)
10.0.0.13: user: warning: [2025-01-09T10:54:41.36886672Z]: [talos] task startAllServices (1/1): starting
10.0.0.13: user: warning: [2025-01-09T10:54:41.36895772Z]: [talos] service[cri](Starting): Starting service
10.0.0.13: user: warning: [2025-01-09T10:54:41.36897672Z]: [talos] service[cri](Waiting): Waiting for network
10.0.0.13: user: warning: [2025-01-09T10:54:41.36907872Z]: [talos] service[cri](Preparing): Running pre state
10.0.0.13: user: warning: [2025-01-09T10:54:41.37716672Z]: [talos] service[cri](Preparing): Creating service runner
10.0.0.13: user: warning: [2025-01-09T10:54:41.37838672Z]: [talos] service[trustd](Starting): Starting service
10.0.0.13: user: warning: [2025-01-09T10:54:41.37904072Z]: [talos] service[trustd](Waiting): Waiting for service "containerd" to be "up", time sync, network
10.0.0.13: user: warning: [2025-01-09T10:54:41.38030772Z]: [talos] service[etcd](Starting): Starting service
10.0.0.13: user: warning: [2025-01-09T10:54:41.38117772Z]: [talos] service[etcd](Waiting): Waiting for service "cri" to be "up", time sync, network, etcd spec
10.0.0.13: user: warning: [2025-01-09T10:54:41.38272172Z]: [talos] task startAllServices (1/1): waiting for 11 services
10.0.0.13: user: warning: [2025-01-09T10:54:41.38355672Z]: [talos] service[trustd](Preparing): Running pre state
10.0.0.13: user: warning: [2025-01-09T10:54:41.38467572Z]: [talos] task startAllServices (1/1): service "apid" to be "up", service "auditd" to be "up", service "containerd" to be "up", service "cri" to be "up", service "dashboard" to be "up", service "etcd" to be "up", service "kubelet" to be "up", service "machined" to be "up", service "syslogd" to be "up", service "trustd" to be "up", service "udevd" to be "up"
10.0.0.13: user: warning: [2025-01-09T10:54:41.38981972Z]: [talos] service[trustd](Preparing): Creating service runner
10.0.0.13: user: warning: [2025-01-09T10:54:41.40532872Z]: [talos] service[cri](Running): Process Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]) started with PID 2306
10.0.0.13: user: warning: [2025-01-09T10:54:41.58448572Z]: [talos] service[apid](Running): Started task apid (PID 2376) for container apid
10.0.0.13: user: warning: [2025-01-09T10:54:41.59172572Z]: [talos] service[trustd](Running): Started task trustd (PID 2377) for container trustd
10.0.0.13: user: warning: [2025-01-09T10:54:41.74993472Z]: [talos] service[kubelet](Starting): Starting service
10.0.0.13: user: warning: [2025-01-09T10:54:41.75010872Z]: [talos] service[kubelet](Waiting): Waiting for service "cri" to be "up", time sync, network
10.0.0.13: user: warning: [2025-01-09T10:54:42.27609372Z]: [talos] service[apid](Running): Health check successful
10.0.0.13: user: warning: [2025-01-09T10:54:42.35354172Z]: [talos] service[etcd](Waiting): Waiting for service "cri" to be "up"
10.0.0.13: user: warning: [2025-01-09T10:54:42.35567772Z]: [talos] service[cri](Running): Health check successful
10.0.0.13: user: warning: [2025-01-09T10:54:42.35765472Z]: [talos] service[kubelet](Preparing): Running pre state
10.0.0.13: user: warning: [2025-01-09T10:54:42.35862372Z]: [talos] service[etcd](Preparing): Running pre state
10.0.0.13: user: warning: [2025-01-09T10:54:42.36975872Z]: [talos] service[trustd](Running): Health check successful
10.0.0.13: user: warning: [2025-01-09T10:54:42.58475872Z]: [talos] fetching hcloud network config from: "http://169.254.169.254/hetzner/v1/metadata/network-config"
10.0.0.13: user: warning: [2025-01-09T10:54:42.59128272Z]: [talos] setting hostname {"component": "controller-runtime", "controller": "network.HostnameSpecController", "hostname": "example-controlplane-3d3d26d1-ngr7t", "domainname": ""}
10.0.0.13: user: warning: [2025-01-09T10:54:42.59306572Z]: [talos] setting resolvers {"component": "controller-runtime", "controller": "network.ResolverSpecController", "resolvers": ["1.1.1.1", "8.8.8.8"], "searchDomains": []}

smira · 2025-01-09T11:34:46Z

I'm not familiar enough with Hetzner cloud, but looks like network got configured via DHCP, while hostname was not sent via DHCP. Hetzner cloud relies on network to be configured to download metadata, so that download happens when the network is configured.

From Talos point of view, as network is configured, it is ready to start running.

So in this case I would say you need to set the hostname via machine configuration as a static one (at least for controlplane nodes), as etcd doesn't support changing member hostnames. Controlplane nodes should have a stable hostname.

The only way to fix this is to disable completely Talos default hostname (so that it would wait for Hetzner to supply one), but we don't have this feature yet.

echozio · 2025-01-09T11:49:43Z

Should not fetching this metadata from Hetzner be considered part of getting the network ready when talos.provider=hcloud is set? I'm not sure what component is responsible for declaring the network ready, but it appears machined is responsible for fetching this metadata based on looking at some of the code.

smira · 2025-01-09T11:53:45Z

It's hard to say whether we should block on this or not, as HCloud provider might be down, but if the network is up, why should we block? As long as we have enough information to proceed, we should proceed.

Disabling completely default hostname is a proper fix here (and up to you to enable this).

echozio · 2025-01-09T11:59:34Z

If disabling the generated default hostname will block service execution until a hostname is set that would likely be a perfectly acceptable fix for my use case. I'm assuming it's machine.features.stableHostname=false in the TalosConfig you're referring to?

smira · 2025-01-09T12:00:34Z

There's no such feature in Talos (yet).

echozio · 2025-01-09T12:11:00Z

I see. What did you mean by this?

Disabling completely default hostname is a proper fix here (and up to you to enable this).

Setting the hostname in the configuration would also solve my issue, but I'd need to get CACPPT to play ball, as it's currently creating machines where the Machine's name != the HCloudMachine's name and it only supports setting the hostname from the Machine's name. This is something I'll investigate further if there isn't currently any feature in Talos I could leverage (e.g. disabling the default hostname).

smira · 2025-01-09T12:14:54Z

Disabling completely default hostname is a proper fix here (and up to you to enable this).

I mean that in general waiting for hcloud metadata doesn't make sense to me, while if you explicitly disable default hostname, Talos would wait for HCloud to provide one (or any other hostname from any other source).

echozio · 2025-01-09T12:22:07Z

I see, I agree that would be a good solution. I guess the "and up to you to enable this" lead me to believe such a feature existed and I could just enable it.

Thanks a lot for your help. I'll update this issue if I'm able to solve this by other means. If disabling the default hostnames is planned or in the works I'd be happy to help out.

echozio · 2025-01-09T13:04:36Z

I made a small change to CACPPT and set spec.controlPlaneConfig.controlplane.hostname.source=MachineName on the TalosControlPlane, which appears to have solved the issue for me. I'll close this as the observed behavior (not waiting for metadata) is not a bug.

echozio closed this as completed Jan 9, 2025

echozio closed this as not planned Won't fix, can't repro, duplicate, stale Jan 9, 2025

This was referenced Jan 9, 2025

feat: create machines with the same name as their underlying infrastructure siderolabs/cluster-api-control-plane-provider-talos#207

Closed

feat: add InfrastructureName HostnameSource siderolabs/cluster-api-bootstrap-provider-talos#202

Merged

talos-bot closed this as completed in siderolabs/cluster-api-bootstrap-provider-talos#202 Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Services started before node identity is established on Hetzner Cloud #10105

Services started before node identity is established on Hetzner Cloud #10105

echozio commented Jan 9, 2025

smira commented Jan 9, 2025

echozio commented Jan 9, 2025

smira commented Jan 9, 2025

echozio commented Jan 9, 2025

smira commented Jan 9, 2025

echozio commented Jan 9, 2025

smira commented Jan 9, 2025

echozio commented Jan 9, 2025

smira commented Jan 9, 2025

echozio commented Jan 9, 2025

echozio commented Jan 9, 2025

Services started before node identity is established on Hetzner Cloud #10105

Services started before node identity is established on Hetzner Cloud #10105

Comments

echozio commented Jan 9, 2025

Bug Report

Description

Environment

smira commented Jan 9, 2025

echozio commented Jan 9, 2025

smira commented Jan 9, 2025

echozio commented Jan 9, 2025

smira commented Jan 9, 2025

echozio commented Jan 9, 2025

smira commented Jan 9, 2025

echozio commented Jan 9, 2025

smira commented Jan 9, 2025

echozio commented Jan 9, 2025

echozio commented Jan 9, 2025