etcd not showing a downed node as actually being down #16461

daveisdigital · 2023-08-22T19:16:26Z

daveisdigital
Aug 22, 2023

I have a 3-node etcd cluster.

I was testing failure scenarios and stopped the etcd daemon on 1 of the 3 nodes. I verified that systemd indeed killed the process.

However I was surprised that when I performed a 'member list' that the node upon which I stopped the etcd service was still listed as being 'started'.

This is etcd 3.5.9.

The log files from one of the peer etcd nodes indeed shows "connection refused" errors regarding the downed node.

My expectation here was that upon detecting the absence of the downed node's daemon, the remaining peers would agree that it's down and change its status accordingly.

If I use 'etcdctl' to get the endpoint status and include the URL for the downed node I get what I would expect which is. And yes, the port of 2479 is correct for this deployment.

Failed to get the status of endpoint etcd-node2:2479 (context deadline exceeded)

The timeout/wait values for this cluster:

grep -Ei "timeout|wait" etcd_14.conf.yml | grep -v ^#

election-timeout: 1000
proxy-failure-wait: 5000
proxy-dial-timeout: 1000
proxy-write-timeout: 5000
proxy-read-timeout: 0

What should I look for here for a clue as to what's happening?

Answered by akuzia

Aug 24, 2023

member list simply list members of the cluster. You need to use endpoint heath to check status of nodes. You will get smth along those lines:

# etcdctl -w table endpoint health
+--------------------+--------+------------+---------------------------+
|      ENDPOINT      | HEALTH |    TOOK    |           ERROR           |
+--------------------+--------+------------+---------------------------+
| https://etcd1:2379 |   true | 1.711468ms |                           |
| https://etcd2:2379 |   true | 2.858825ms |                           |
| https://etcd3:2379 |  false |            | context deadline exceeded |
+--------------------+--------+------------+---------------------------+
Error: un…

View full answer

akuzia · 2023-08-24T11:06:14Z

akuzia
Aug 24, 2023

member list simply list members of the cluster. You need to use endpoint heath to check status of nodes. You will get smth along those lines:

# etcdctl -w table endpoint health
+--------------------+--------+------------+---------------------------+
|      ENDPOINT      | HEALTH |    TOOK    |           ERROR           |
+--------------------+--------+------------+---------------------------+
| https://etcd1:2379 |   true | 1.711468ms |                           |
| https://etcd2:2379 |   true | 2.858825ms |                           |
| https://etcd3:2379 |  false |            | context deadline exceeded |
+--------------------+--------+------------+---------------------------+
Error: unhealthy cluster

all of etcd params passed as env variables. You still need to provide endpoints

1 reply

jmhbnz Aug 26, 2023
Maintainer

Thanks @akuzia for the initial response. To add to this, the logic for etcdctl member list is extremely basic in relation to status, refer: https://github.com/etcd-io/etcd/blob/main/etcdctl/ctlv3/command/printer.go#L180-L183

As suggested above, for detailed health or status information please use etcdctl endpoint health or etcdctl endpoint status.

You're welcome to raise a feature request or better yet a pull request if you would like to improve the output of etcdctl member list. Contributions are certainly welcomed!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd not showing a downed node as actually being down #16461

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

etcd not showing a downed node as actually being down #16461

daveisdigital Aug 22, 2023

grep -Ei "timeout|wait" etcd_14.conf.yml | grep -v ^#

Replies: 1 comment · 1 reply

akuzia Aug 24, 2023

jmhbnz Aug 26, 2023 Maintainer

daveisdigital
Aug 22, 2023

Replies: 1 comment 1 reply

akuzia
Aug 24, 2023

jmhbnz Aug 26, 2023
Maintainer