etcd not showing a downed node as actually being down #16461
-
I have a 3-node etcd cluster. I was testing failure scenarios and stopped the etcd daemon on 1 of the 3 nodes. I verified that systemd indeed killed the process. However I was surprised that when I performed a 'member list' that the node upon which I stopped the etcd service was still listed as being 'started'. This is etcd 3.5.9. The log files from one of the peer etcd nodes indeed shows "connection refused" errors regarding the downed node. My expectation here was that upon detecting the absence of the downed node's daemon, the remaining peers would agree that it's down and change its status accordingly. If I use 'etcdctl' to get the endpoint status and include the URL for the downed node I get what I would expect which is. And yes, the port of 2479 is correct for this deployment. Failed to get the status of endpoint etcd-node2:2479 (context deadline exceeded) The timeout/wait values for this cluster: grep -Ei "timeout|wait" etcd_14.conf.yml | grep -v ^#election-timeout: 1000 What should I look for here for a clue as to what's happening? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
member list simply list members of the cluster. You need to use endpoint heath to check status of nodes. You will get smth along those lines:
|
Beta Was this translation helpful? Give feedback.
member list simply list members of the cluster. You need to use endpoint heath to check status of nodes. You will get smth along those lines: