diff --git a/latest/bpg/book.adoc b/latest/bpg/book.adoc index 083ffaaba..3508306e3 100644 --- a/latest/bpg/book.adoc +++ b/latest/bpg/book.adoc @@ -46,6 +46,8 @@ include::cost/cost_optimization_index.adoc[leveloffset=+1] include::windows/index.adoc[leveloffset=+1] +include::hybrid/index.adoc[leveloffset=+1] + include::contribute.adoc[leveloffset=+1] diff --git a/latest/bpg/hybrid/index.adoc b/latest/bpg/hybrid/index.adoc new file mode 100644 index 000000000..b2f2abace --- /dev/null +++ b/latest/bpg/hybrid/index.adoc @@ -0,0 +1,24 @@ +//!!NODE_ROOT +[[hybrid,hybrid.title]] += Best Practices for Hybrid Deployments +:doctype: book +:sectnums: +:toc: left +:icons: font +:experimental: +:idprefix: +:idseparator: - +:sourcedir: . +:info_doctype: chapter +:info_title: Best Practices for Hybrid Deployments +:info_abstract: Best Practices for Hybrid Deployments +:info_titleabbrev: Hybrid +:imagesdir: images/hybrid/ + +This guide provides guidance on running deployments in on-premises or edge environments with EKS Hybrid Nodes or EKS Anywhere. + +We currently have published guides for the following topics: + +- xref:network-disconnections[Best Practices for EKS Hybrid Nodes and network disconnections] + +include::network-disconnections/index.adoc[leveloffset=+1] \ No newline at end of file diff --git a/latest/bpg/hybrid/network-disconnections/app-network-traffic.adoc b/latest/bpg/hybrid/network-disconnections/app-network-traffic.adoc new file mode 100644 index 000000000..560d386e3 --- /dev/null +++ b/latest/bpg/hybrid/network-disconnections/app-network-traffic.adoc @@ -0,0 +1,133 @@ +[.topic] +[[hybrid-nodes-app-network-traffic,hybrid-nodes-app-network-traffic.title]] += Application network traffic through network disconnections +:info_doctype: section +:info_title: Application network traffic through network disconnections +:info_titleabbrev: Application network traffic +:info_abstract: Application network traffic through network disconnections + +The topics on this page are related to Kubernetes cluster networking and the application traffic during network disconnections between nodes and the Kubernetes control plane. + +== Cilium + +Cilium has several modes for IP address management (IPAM), encapsulation, load balancing, and cluster routing. The modes validated in this guide used Cluster Scope IPAM, VXLAN overlay, BGP load balancing, and kube-proxy. Cilium was also used without BGP load balancing, replacing it with MetalLB L2 load balancing. + +The base of the Cilium install consists of the Cilium operator and Cilium agents. The Cilium operator runs as a Deployment and registers the Cilium Custom Resource Definitions (CRDs), manages IPAM, and synchronizes cluster objects with the Kubernetes API server among https://docs.cilium.io/en/stable/internals/cilium_operator/[other capabilities]. The Cilium agents run on each node as a DaemonSet and manages the eBPF programs to control the network rules for workloads running on the cluster. + +Generally, the in-cluster routing configured by Cilium remains available and in-place during network disconnections, which can be confirmed by observing the in-cluster traffic flows and ip table rules for the pod network. + +[source,bash,subs="verbatim,attributes,quotes"] +---- +ip route show table all | grep cilium +---- + +[source,bash,subs="verbatim,attributes,quotes"] +---- +10.86.2.0/26 via 10.86.3.16 dev cilium_host proto kernel src 10.86.3.16 mtu 1450 +10.86.2.64/26 via 10.86.3.16 dev cilium_host proto kernel src 10.86.3.16 mtu 1450 +10.86.2.128/26 via 10.86.3.16 dev cilium_host proto kernel src 10.86.3.16 mtu 1450 +10.86.2.192/26 via 10.86.3.16 dev cilium_host proto kernel src 10.86.3.16 mtu 1450 +10.86.3.0/26 via 10.86.3.16 dev cilium_host proto kernel src 10.86.3.16 +10.86.3.16 dev cilium_host proto kernel scope link +... +---- + +However, during network disconnections, the Cilium operator and Cilium agents restart due to the coupling of their health checks with the health of the connection with the Kubernetes API server. It is expected to see the following in the logs of the Cilium operator and Cilium agents during network disconnections. During the network disconnections, you can use tools such as crictl to observe the restarts of these components including their logs. + +[source,bash,subs="verbatim,attributes,quotes"] +---- +msg="Started gops server" address="127.0.0.1:9890" subsys=gops +msg="Establishing connection to apiserver" host="https://:443" subsys=k8s-client +msg="Establishing connection to apiserver" host="https://:443" subsys=k8s-client +msg="Unable to contact k8s api-server" error="Get \"https://:443/api/v1/namespaces/kube-system\": dial tcp :443: i/o timeout" ipAddr="https://:443" subsys=k8s-client +msg="Start hook failed" function="client.(*compositeClientset).onStart (agent.infra.k8s-client)" error="Get \"https://:443/api/v1/namespaces/kube-system\": dial tcp :443: i/o timeout" +msg="Start failed" error="Get \"https://:443/api/v1/namespaces/kube-system\": dial tcp :443: i/o timeout" duration=1m5.003834026s +msg=Stopping +msg="Stopped gops server" address="127.0.0.1:9890" subsys=gops +msg="failed to start: Get \"https://:443/api/v1/namespaces/kube-system\": dial tcp :443: i/o timeout" subsys=daemon +---- + +If you are using Cilium’s BGP Control Plane capability for application load balancing, the BGP session for your pods and services may be down during network disconnections because the BGP speaker functionality is integrated with the Cilium agent, and the Cilium agent will continuously restart when disconnected from the Kubernetes control plane. For more information, see the Cilium BGP Control Plane Operation Guide in the Cilium documentation. Additionally, if you experience a simultaneous failure during a network disconnection such as a power cycle or machine reboot, the Cilium routes will not be preserved through these actions, though the routes are recreated when the node reconnects to the Kubernetes control plane and Cilium starts up again. + +== Calico + +_Coming soon_ + +== MetalLB + +MetalLB has two modes for load balancing; https://metallb.universe.tf/concepts/layer2/[L2 mode] and https://metallb.universe.tf/concepts/bgp/[BGP mode]. Reference the MetalLB documentation for details of how these load balancing modes work as well as their limitations. The validation for this guide used MetalLB in L2 mode, where one machine in the cluster takes ownership of the Kubernetes Service, and uses ARP for IPv4 to make the load balancer IPs reachable on the local network. When running MetalLB there is a controller that is responsible for the IP assignment and speakers that run on each node which are responsible for advertising services with assigned IPs. The MetalLB controller runs as a Deployment and the MetalLB speakers run as a DaemonSet. During network disconnections, the MetalLB controller and speakers will fail to watch the Kubernetes API server for cluster resources but continue running. Most importantly, the Services that are using MetalLB for external connectivity remain available and accessible during network disconnections. + +== kube-proxy + +In EKS clusters, kube-proxy runs as a DaemonSet on each node and is responsible for managing network rules to enable communication between services and pods by translating service IP addresses to the IP addresses of the underlying pods. The iptables rules configured by kube-proxy are maintained during network disconnections and in-cluster routing continues to function and the kube-proxy pods continue to run. + +You can observe the kube-proxy rules with the following iptables commands. The first command shows packets going through the `PREROUTING` chain get directed to the `KUBE-SERVICES` chain. + +[source,bash,subs="verbatim,attributes,quotes"] +---- +iptables -t nat -L PREROUTING +---- + +[source,bash,subs="verbatim,attributes,quotes"] +---- +Chain PREROUTING (policy ACCEPT) +target prot opt source destination +KUBE-SERVICES all -- anywhere anywhere /* kubernetes service portals */ +---- + +Inspecting the `KUBE-SERVICES` chain we can see the rules for the various cluster services. + +[source,bash,subs="verbatim,attributes,quotes"] +---- +Chain KUBE-SERVICES (2 references) +target prot opt source destination +KUBE-SVL-NZTS37XDTDNXGCKJ tcp -- anywhere 172.16.189.136 /* kube-system/hubble-peer:peer-service cluster IP */ +KUBE-SVC-2BINP2AXJOTI3HJ5 tcp -- anywhere 172.16.62.72 /* default/metallb-webhook-service cluster IP */ +KUBE-SVC-LRNEBRA3Z5YGJ4QC tcp -- anywhere 172.16.145.111 /* default/redis-leader cluster IP */ +KUBE-SVC-I7SKRZYQ7PWYV5X7 tcp -- anywhere 172.16.142.147 /* kube-system/eks-extension-metrics-api:metrics-api cluster IP */ +KUBE-SVC-JD5MR3NA4I4DYORP tcp -- anywhere 172.16.0.10 /* kube-system/kube-dns:metrics cluster IP */ +KUBE-SVC-TCOU7JCQXEZGVUNU udp -- anywhere 172.16.0.10 /* kube-system/kube-dns:dns cluster IP */ +KUBE-SVC-ERIFXISQEP7F7OF4 tcp -- anywhere 172.16.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ +KUBE-SVC-ENODL3HWJ5BZY56Q tcp -- anywhere 172.16.7.26 /* default/frontend cluster IP */ +KUBE-EXT-ENODL3HWJ5BZY56Q tcp -- anywhere /* default/frontend loadbalancer IP */ +KUBE-SVC-NPX46M4PTMTKRN6Y tcp -- anywhere 172.16.0.1 /* default/kubernetes:https cluster IP */ +KUBE-SVC-YU5RV2YQWHLZ5XPR tcp -- anywhere 172.16.228.76 /* default/redis-follower cluster IP */ +KUBE-NODEPORTS all -- anywhere anywhere /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ +---- + +Inspecting the chain of the frontend service for the application we can see the pod IP addresses backing the service. + +[source,bash,subs="verbatim,attributes,quotes"] +---- +iptables -t nat -L KUBE-SVC-ENODL3HWJ5BZY56Q +---- + +[source,bash,subs="verbatim,attributes,quotes"] +---- +Chain KUBE-SVC-ENODL3HWJ5BZY56Q (2 references) +target prot opt source destination +KUBE-SEP-EKXE7ASH7Y74BGBO all -- anywhere anywhere /* default/frontend -> 10.86.2.103:80 */ statistic mode random probability 0.33333333349 +KUBE-SEP-GCY3OUXWSVMSEAR6 all -- anywhere anywhere /* default/frontend -> 10.86.2.179:80 */ statistic mode random probability 0.50000000000 +KUBE-SEP-6GJJR3EF5AUP2WBU all -- anywhere anywhere /* default/frontend -> 10.86.3.47:80 */ +---- + +The following kube-proxy log messages are expected during network disconnections as it attempts to watch the Kubernetes API server for updates to node and endpoint resources. + +[source,bash,subs="verbatim,attributes,quotes"] +---- +"Unhandled Error" err="k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.Node: failed to list *v1.Node: Get \"https:///api/v1/nodes?fieldSelector=metadata.name%3D&resourceVersion=2241908\": dial tcp :443: i/o timeout" logger="UnhandledError" +"Unhandled Error" err="k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Get \"https:///apis/discovery.k8s.io/v1/endpointslices?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=2242090\": dial tcp :443: i/o timeout" logger="UnhandledError" +---- + +== CoreDNS + +By default, pods in EKS clusters use the CoreDNS cluster IP address as the name server for in-cluster DNS queries. In EKS clusters, CoreDNS runs as a deployment on nodes. With hybrid nodes, pods are able to continue communicating with the CoreDNS during network disconnections when there are CoreDNS replicas running locally on hybrid nodes. If you have an EKS cluster with nodes in the cloud and hybrid nodes in your on-premises environment, it is recommended to have at least 1 CoreDNS replica in each environment. CoreDNS continues serving DNS queries for records that were created before the network disconnection and continues running through the network reconnection for static stability. + +The following CoreDNS log messages are expected during network disconnections as it attempts to list objects from the Kubernetes API server. + +[source,bash,subs="verbatim,attributes,quotes"] +---- +Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://:443/api/v1/namespaces?resourceVersion=2263964": dial tcp :443: i/o timeout +Failed to watch *v1.Service: failed to list *v1.Service: Get "https://:443/api/v1/services?resourceVersion=2263966": dial tcp :443: i/o timeout +Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Get "https://:443/apis/discovery.k8s.io/v1/endpointslices?resourceVersion=2263896": dial tcp : i/o timeout +---- \ No newline at end of file diff --git a/latest/bpg/hybrid/network-disconnections/best-practices.adoc b/latest/bpg/hybrid/network-disconnections/best-practices.adoc new file mode 100644 index 000000000..e827a69c5 --- /dev/null +++ b/latest/bpg/hybrid/network-disconnections/best-practices.adoc @@ -0,0 +1,137 @@ +[.topic] +[[hybrid-nodes-network-disconnection-best-practices,hybrid-nodes-network-disconnection-best-practices.title]] += Best practices for stability through network disconnections +:info_doctype: section +:info_title: Best practices for stability through network disconnections +:info_titleabbrev: Best practices +:info_abstract: Best practices for stability through network disconnections + + +== Highly available networking + +The best thing you can do to avoid network disconnections between hybrid nodes and the Kubernetes control plane is to have redundant, resilient connections from your on-premises environment to/from AWS. Reference the https://docs.aws.amazon.com/directconnect/latest/UserGuide/resiliency_toolkit.html[AWS Direct Connect Resiliency Toolkit] and https://docs.aws.amazon.com/vpn/latest/s2svpn/vpn-redundant-connection.html[AWS Site-to-Site VPN documentation] for more information on architecting for highly available hybrid networks with those solutions. + +== Highly available applications + +When architecting applications, consider your failure domains and the effects of different types of outages. Kubernetes has built-in mechanisms to deploy and maintain application replicas across node, zone, and regional domains. The usage of these mechanisms depends on your application architecture, environments, and availability requirements. For example, stateless applications can often be deployed with multiple replicas and can move across arbitrary hosts and infrastructure capacity, and you can use node selectors and topology spread constraints to run instances of the application across different domains. For details of application-level techniques you can use to build resilient applications on Kubernetes, reference the https://aws.github.io/aws-eks-best-practices/reliability/docs/application/[EKS Best Practices Guide]. + +Kubernetes evaluates zonal information for nodes when they are disconnected from the Kubernetes control plane while determining whether to move pods to other nodes. If all nodes in a zone are unreachable, Kubernetes cancels pod evictions for the nodes in that zone. As a best practice, if you have a deployment with nodes running in multiple data centers or physical locations, you should assign a zone to each node based on the data center or physical location where the node is running. When you run EKS with nodes in the cloud, this zone label is automatically applied by the AWS cloud-controller-manager. However, a cloud-controller-manager is not used with hybrid nodes, so you can pass this information via your kubelet configuration. An example of how to configure a zone in your node configuration for hybrid nodes is shown below. The configuration below is passed when you connect your hybrid nodes to your cluster with the hybrid nodes CLI (`nodeadm`). For more information on the `topology.kubernetes.io/zone` label, see the https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesiozone[Kubernetes documentation]. For more information on the hybrid nodes CLI, see the https://docs.aws.amazon.com/eks/latest/userguide/hybrid-nodes-nodeadm.html[Hybrid Nodes nodeadm reference]. + +[source,yaml,subs="verbatim,attributes,quotes"] +---- +apiVersion: node.eks.aws/v1alpha1 +kind: NodeConfig +spec: + cluster: + name: my-cluster + region: my-region + kubelet: + flags: + - --node-labels=topology.kubernetes.io/zone=dc1 + hybrid: + ... +---- + +== Network monitoring + +If you are using AWS Direct Connect or AWS Site-to-Site VPN for your hybrid connectivity, you can use CloudWatch alarms, logs, and metrics to observe the state of your hybrid connection and diagnose issues. For more information, see https://docs.aws.amazon.com/directconnect/latest/UserGuide/monitoring-overview.html[Monitoring AWS Direct Connect resources] and https://docs.aws.amazon.com/vpn/latest/s2svpn/monitoring-overview-vpn.html[Monitor an AWS Site-to-Site VPN connection]. + +It is recommended to create alarms for `NodeNotReady` events reported by the node-lifecycle-controller running on the EKS control plane, which signals that a hybrid node may be experiencing a network disconnection. You can create this alarm by enabling EKS control plane logging for the Controller Manager and creating a Metric Filter in CloudWatch for the “Recording status change event message for node” message with the status=“NodeNotReady”. After creating a Metric Filter, you can create an alarm for this filter based on your desired thresholds. See https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Alarm-On-Logs.html[Alarming for logs in the CloudWatch documentation] for more information. + +You can use the Transit Gateway (TGW) and Virtual Private Gateway (VGW) built-in metrics to observe the network traffic into and out of your TGW or VPW. You can create alarms for these metrics to detect scenarios when network traffic dips below normal levels, signaling a network issue between hybrid nodes and the EKS control plane. The TGW and VGW metrics are described in the table below. + +[cols="2,1,5"] +|=== +|Gateway|Metric|Description + +|Transit Gateway +|BytesIn +|The bytes received by TGW from the attachment (EKS control plane to hybrid nodes) + +|Transit Gateway +|BytesOut +|The bytes sent from TGW to the attachment (hybrid nodes to EKS control plane) + +|Virtual Private Gateway +|TunnelDataIn +|The bytes sent from the AWS side of the connection through the VPN tunnel to the customer gateway (EKS control plane to hybrid nodes) + +|Virtual Private Gateway +|TunnelDataOut +|The bytes received on the AWS side of the connection through the VPN tunnel from a customer gateway (hybrid nodes to EKS control plane) +|=== + +You can also use https://aws.amazon.com/blogs/networking-and-content-delivery/monitor-hybrid-connectivity-with-amazon-cloudwatch-network-monitor/[CloudWatch Network Monitor] to gain further insight into your hybrid connections to reduce mean time to recovery and determine if network issues are in AWS or your environment. CloudWatch Network Monitor can be used to visualize packet loss and latency of your hybrid network connections, set alerts and thresholds, and then take action to improve your network experience. For more information, see https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/what-is-network-monitor.html[Using Amazon CloudWatch Network Monitor]. + +EKS offers several options for monitoring the health of your clusters and applications. For cluster health you can use the observability dashboard in the EKS console to quickly detect, troubleshoot, and remediate issues. You can also use Amazon Managed Service for Prometheus, AWS Distro for Open Telemetry (ADOT), and CloudWatch for cluster, application, and infrastructure monitoring. For more information on the observability options in EKS, see https://docs.aws.amazon.com/eks/latest/userguide/eks-observe.html[Monitor your cluster performance and view logs]. + +== Local troubleshooting + +To prepare for network disconnections between hybrid nodes and the EKS control plane, you can set up secondary monitoring and logging backends to enable continuous observability for applications when regional AWS services are not available. For example, this can be accomplished with the AWS Distro for Open Telemetry (ADOT) collector which can be configured to send metrics and logs to multiple backends. You can also use local tools such as `crictl` to interact locally with pods and containers as a replacement for `kubectl` or other Kubernetes API-compatible clients that typically query through the Kubernetes API server endpoint. For more information on `crictl`, see the https://github.com/kubernetes-sigs/cri-tools/blob/master/docs/crictl.md[`crictl` documentation] in the cri-tools GitHub. A few useful crictl commands are listed below. + +List pods running on the host + +[source,bash,subs="verbatim,attributes,quotes"] +---- +crictl pods +---- + +List containers running on the host + +[source,bash,subs="verbatim,attributes,quotes"] +---- +crictl ps +---- + +List images running on the host + +[source,bash,subs="verbatim,attributes,quotes"] +---- +crictl images +---- + +Get logs of a container running on the host + +[source,bash,subs="verbatim,attributes,quotes"] +---- +crictl logs CONTAINER_NAME +---- + +Get statistics of pods running on the host +[source,bash,subs="verbatim,attributes,quotes"] +---- +crictl statsp +---- + +== Application network traffic + +When using hybrid nodes, it is important to consider and understand the network flows of your application traffic and the technologies you are using to expose your applications externally to your cluster. Different technologies for application load balancing and ingress behave differently during network disconnections. For example, if you are using Cilium’s BGP Control Plane capability for application load balancing, the BGP session for your pods and services may be down during network disconnections because the BGP speaker functionality is integrated with the Cilium agent, and the Cilium agent will continuously restart when disconnected from the Kubernetes control plane. The reason for the restart is due to Cilium’s health check failing because its health is coupled with access to the Kubernetes control plane (see https://github.com/cilium/cilium/issues/31702[CFP: #31702] with opt-in improvement in Cilium v1.17). Similarly, if you are using Application Load Balancers (ALB) or Network Load Balancers (NLB) for AWS Region-originated application traffic, this traffic be temporarily down if your on-premises environment loses connectivity to the AWS Region. It is recommended to validate the technologies you are using for load balancing and ingress remain stable during network disconnections before deploying to production to avoid unexpected behaviors. The example in the https://github.com/aws-samples/eks-hybrid-examples[aws-samples/eks-hybrid-examples] GitHub repo uses MetalLB for load balancing in https://metallb.universe.tf/concepts/layer2/[L2 mode], which remains stable during network disconnections between hybrid nodes and the EKS control plane. + +== Review dependencies on remote AWS services + +When using hybrid nodes, you should be aware of and intentional about the dependencies you take on regional AWS services that are external to your on-premises or edge environment. Examples include accessing S3 or RDS for application data, using Amazon Managed Service for Prometheus or CloudWatch for metrics and logs, using Application and Network Load Balancers for region-originated traffic, and pulling containers from Elastic Container Registry. These services will not be accessible during network disconnections between your on-premises environment and AWS. If your on-premises environment is highly susceptible to network disconnections with AWS, you should review your use of AWS services and ensure that losing a connection to other AWS services does not impact the static stability of your applications. + +== Tune Kubernetes pod failover behavior + +There are options to tune pod failover behavior during network disconnections for applications that are not portable across hosts or for resource-constrained environments that do not have spare capacity for pod failover. Generally, it is important to consider the resource requirements of your applications and to have enough capacity for one or more instances of the application to failover to a different host in the event of node failure. + +- [.underline]#Option 1 - Use DaemonSets#: This option applies for applications that can and should run on all nodes in the cluster. DaemonSets are automatically configured to tolerate the unreachable taint, which binds the DaemonSet pods to nodes through network disconnections. +- [.underline]#Option 2 - Tune `tolerationSeconds` for unreachable taint#: You can tune the amount of time your pods remain bound to nodes in the event of network disconnections. You can do this by configuring application pods to tolerate the unreachable taint with `NoExecute` effect for a duration you configure (`tolerationSeconds` in application spec). With option, when there are network disconnections, your application pods remain bound to nodes until `tolerationSeconds` expires. This option should be considered with care, as increasing the `tolerationSeconds` for the unreachable taint with `NoExecute` means that pods running on unreachable hosts may take longer to be moved to other reachable, healthy hosts. +- [.underline]#Option 3: Custom controller#: You can create and run a custom controller (or other software) that monitors Kubernetes for the unreachable taint with `NoExecute` effect. When this taint is detected, the custom controller can check application-specific metrics for the health of the application. If the application is healthy, the custom controller can remove the unreachable taint, which will negate the eviction of pods from nodes during network disconnections. + +An example of how to configure a Deployment with `tolerationSeconds` for the unreachable taint is shown below. In the example below, `tolerationSeconds` is set to `1800` (30 minutes), which means pods running on unreachable nodes will only be evicted if the network disconnection lasts for longer than 30 minutes. + +[source,yaml,subs="verbatim,attributes,quotes"] +---- +apiVersion: apps/v1 +kind: Deployment +metadata: +... +spec: +... + tolerations: + - key: "node.kubernetes.io/unreachable" + operator: "Exists" + effect: "NoExecute" + tolerationSeconds: 1800 +---- \ No newline at end of file diff --git a/latest/bpg/hybrid/network-disconnections/host-credentials.adoc b/latest/bpg/hybrid/network-disconnections/host-credentials.adoc new file mode 100644 index 000000000..12da2a3bd --- /dev/null +++ b/latest/bpg/hybrid/network-disconnections/host-credentials.adoc @@ -0,0 +1,65 @@ +[.topic] +[[hybrid-nodes-host-creds,hybrid-nodes-host-creds.title]] += Host credentials through network disconnections +:info_doctype: section +:info_title: Host credentials through network disconnections +:info_titleabbrev: Host credentials +:info_abstract: Host credentials through network disconnections + +EKS Hybrid Nodes is integrated with AWS Systems Manager (SSM) hybrid activations and AWS IAM Roles Anywhere for temporary IAM credentials that are used to authenticate the node with the EKS control plane. Both SSM and IAM Roles Anywhere have mechanisms for automatically refreshing the temporary credentials that it manages on the on-premises hosts. It is recommended to have a consistent credential provider across the hybrid nodes for your cluster, meaning use either SSM hybrid activations or IAM Roles Anywhere, but not both. + +== SSM hybrid activations + +The temporary credentials provisioned by SSM are valid for 1 hour. You cannot alter the credential validity duration when using SSM as your credential provider. The temporary credentials are automatically rotated by SSM before they expire and the rotation does not impact the status of your nodes or applications. However, when there are network disconnections between the SSM agent and the SSM regional endpoint, SSM is unable to refresh the credentials, and the credentials may expire. + +SSM uses an exponential backoff for credential refresh retry when it is unable to connect to the SSM regional endpoints. In SSM agent version `3.3.808.0` and later (released August 2024), the exponential backoff is capped at 30 minutes. Depending on the duration of your network disconnection, it may take up to 30 minutes for the SSM credentials to be refreshed, and hybrid nodes will not automatically reconnect to the EKS control plane until the credentials are refreshed. In this scenario, you can restart the SSM agent to force a credential refresh. A side effect of the current SSM credential refresh behavior is that nodes may reconnect at staggering times based on when the SSM agent running on the node is able to refresh its credentials. Because of this, you may see pod failover from nodes that are not yet reconnected to nodes that are already reconnected. + +Get SSM agent version. You can alternatively check the Fleet Manager section of the SSM console. + +[source,bash,subs="verbatim,attributes,quotes"] +---- +# AL2023, RHEL +yum info amazon-ssm-agent + +# Ubuntu +snap list amazon-ssm-agent +---- + +Restart SSM agent + +[source,bash,subs="verbatim,attributes,quotes"] +---- +# AL2023, RHEL +systemctl restart amazon-ssm-agent + +# Ubuntu +systemctl restart snap.amazon-ssm-agent.amazon-ssm-agent +---- + +View SSM agent logs + +[source,bash,subs="verbatim,attributes,quotes"] +---- +tail -f /var/log/amazon/ssm/amazon-ssm-agent.log +---- + +Expected log messages during network disconnections + +[source,bash,subs="verbatim,attributes,quotes"] +---- +INFO [CredentialRefresher] Credentials ready +INFO [CredentialRefresher] Next credential rotation will be in 29.995040663666668 minutes +ERROR [CredentialRefresher] Retrieve credentials produced error: RequestError: send request failed +INFO [CredentialRefresher] Sleeping for 35s before retrying retrieve credentials +ERROR [CredentialRefresher] Retrieve credentials produced error: RequestError: send request failed +INFO [CredentialRefresher] Sleeping for 56s before retrying retrieve credentials +ERROR [CredentialRefresher] Retrieve credentials produced error: RequestError: send request failed +INFO [CredentialRefresher] Sleeping for 1m24s before retrying retrieve credentials +---- + +== IAM Roles Anywhere + +The temporary credentials provisioned by IAM Roles Anywhere are valid for 1 hour by default. You can configure the credential validity duration with IAM Roles Anywhere through the https://docs.aws.amazon.com/rolesanywhere/latest/userguide/authentication-create-session.html#credentials-object[`durationSeconds`] field of your IAM Roles Anywhere profile. The maximum credential validity duration is 12 hours. The https://docs.aws.amazon.com/managedservices/latest/ctref/management-advanced-identity-and-access-management-iam-update-maxsessionduration.html[`MaxSessionDuration`] setting on your Hybrid Nodes IAM role must be greater than the `durationSeconds` setting on your IAM Roles Anywhere profile. + +When using IAM Roles Anywhere as the credential provider for your hybrid nodes, the reconnection to the EKS control plane after network disconnections generally occurs within seconds of the network restoration because the kubelet calls `aws_signing_helper credential-process` to get credentials on-demand. While not directly related to hybrid nodes and network disconnections, you can configure notifications and alerts for certificate expiry when you use IAM Roles Anywhere, see https://docs.aws.amazon.com/rolesanywhere/latest/userguide/customize-notification-settings.html[Customize notification settings in IAM Roles Anywhere]. + diff --git a/latest/bpg/hybrid/network-disconnections/index.adoc b/latest/bpg/hybrid/network-disconnections/index.adoc new file mode 100644 index 000000000..54200a329 --- /dev/null +++ b/latest/bpg/hybrid/network-disconnections/index.adoc @@ -0,0 +1,33 @@ +//!!NODE_ROOT
+[."topic"] +[[hybrid-nodes-network-disconnections,hybrid-nodes-network-disconnections.title]] += EKS Hybrid Nodes and network disconnections +:doctype: section +:sectnums: +:toc: left +:icons: font +:experimental: +:idprefix: +:idseparator: - +:sourcedir: . +:info_doctype: chapter +:info_title: EKS Hybrid Nodes and network disconnections +:info_abstract: EKS Hybrid Nodes and network disconnections +:info_titleabbrev: Hybrid-Nodes-Network-Disconnections +:imagesdir: images/hybrid/ + +The EKS Hybrid Nodes architecture can be new to customers who are accustomed to running local Kubernetes clusters entirely in their own data centers or edge locations. With EKS Hybrid Nodes, the Kubernetes control plane runs in an AWS Region and only the nodes run on-premises, resulting in a “stretched” or “extended” Kubernetes cluster architecture. + +This leads to a common question, “What happens if my nodes get disconnected from the Kubernetes control plane?” + +In this guide, we answer that question through a review of the following topics. It is recommended to validate the stability and reliability of your applications through network disconnections as each application may behave differently based on its dependencies, configuration, and environment. See the aws-samples/eks-hybrid-examples GitHub repo for test setup, procedures, and results you can reference to test network disconnections with EKS Hybrid Nodes and your own applications. The GitHub repo also contains additional details of the tests used to validate the behavior explained in this guide. + +- xref:hybrid-nodes-network-disconnection-best-practices[Best practices for stability through network disconnections] +- xref:hybrid-nodes-kubernetes-pod-failover[Kubernetes pod failover behavior through network disconnections] +- xref:hybrid-nodes-app-network-traffic[Application network traffic through network disconnections] +- xref:hybrid-nodes-host-creds[Host credentials through network disconnections] + +include::best-practices.adoc[leveloffset=+1] +include::kubernetes-pod-failover.adoc[leveloffset=+1] +include::app-network-traffic.adoc[leveloffset=+1] +include::host-credentials.adoc[leveloffset=+1] \ No newline at end of file diff --git a/latest/bpg/hybrid/network-disconnections/kubernetes-pod-failover.adoc b/latest/bpg/hybrid/network-disconnections/kubernetes-pod-failover.adoc new file mode 100644 index 000000000..29bfc3096 --- /dev/null +++ b/latest/bpg/hybrid/network-disconnections/kubernetes-pod-failover.adoc @@ -0,0 +1,136 @@ +[.topic] +[[hybrid-nodes-kubernetes-pod-failover,hybrid-nodes-kubernetes-pod-failover.title]] += Kubernetes pod failover through network disconnections +:info_doctype: section +:info_title: Kubernetes pod failover through network disconnections +:info_titleabbrev: Kubernetes pod failover +:info_abstract: Kubernetes pod failover through network disconnections + +We start with a review of the key concepts, components, and settings that play a role in how Kubernetes behaves during network disconnections between nodes and the Kubernetes control plane. EKS is upstream Kubernetes conformant, so all of the Kubernetes concepts, components, and settings detailed in this section apply to EKS and EKS Hybrid Nodes deployments. + +== Concepts + +[.underline]#Taints and Tolerations#: Taints and tolerations are used in Kubernetes to control the scheduling of pods onto nodes. Taints are used by the node-lifecycle-controller to indicate nodes that are not eligible for scheduling and nodes that should evict the pods running on them. When nodes are unreachable due to a network disconnection, the node.kubernetes.io/unreachable taint is applied by the node-lifecycle-controller with a NoSchedule effect and with a NoExecute effect if certain conditions are met. The node.kubernetes.io/unreachable taint applied by the node-lifecycle-controller corresponds to the NodeCondition Ready being Unknown. Tolerations for taints can be specified by users at the application level in the PodSpec. + +* NoSchedule: No new Pods will be scheduled on the tainted node unless they have a matching toleration. Pods currently running on the node are not evicted. +* NoExecute: Pods that do not tolerate the taint are evicted immediately. Pods that tolerate the taint without specifying tolerationSeconds in their toleration specification remain bound forever. Pods that tolerate the taint with a specified tolerationSeconds remain bound for the specified amount of time. After that time elapses, the node lifecycle controller evicts the Pods from the node. + +[.underline]#Node Leases#: Kubernetes uses the Lease API to communicate kubelet node heartbeats to the Kubernetes API server. For every node, there is a Lease object with a matching name. Under the hood, every kubelet heartbeat is an update request to this Lease object, updating the spec.renewTime field for the Lease. The Kubernetes control plane uses the timestamp of this field to determine node availability. In the case of network disconnections between nodes and the Kubernetes control plane, nodes are unable to update the spec.renewTime of their leases, which is interpreted to signal that the NodeCondition Ready is Unknown by the Kubernetes control plane. + +== Components + +image::images/hybrid/k8s-components-pod-failover.png[Kubernetes components involved in pod failover behavior,scaledwidth=100%] + +[cols="2,2,5"] +|=== +|Component|Sub-component|Description + +|Kubernetes control plane +|kube-api-server +|The API server is a component of the Kubernetes control plane that exposes the Kubernetes API. + +|Kubernetes control plane +|node-lifecycle-controller +|One of the controllers that the kube-controller-manager runs. Responsible for noticing and responding to issues with nodes. + +|Kubernetes control plane +|kube-scheduler +|Control plane component that watches for newly created Pods with no assigned node, and selects a node for them to run on. + +|Kubernetes nodes +|kubelet +|An agent that runs on each node in the cluster. The kubelet takes a set of PodSpecs and ensures the containers described in those PodSpecs are running and healthy. +|=== + +== Configuration settings + +[cols="1,2,5,1,1,1"] +|=== +|Component|Setting|Description|K8s default|EKS default|Configurable in EKS + +|kube-api-server +|default-unreachable-toleration-seconds +|Indicates the `tolerationSeconds` of the toleration for `unreachable:NoExecute` that is added by default to every pod that does not already have such a toleration. +|300 +|300 +|No + +|node-lifecycle-controller +|node-monitor-grace-period +|Amount of time a node can be unresponsive before marking it unhealthy. Must be N times more than kubelet's `nodeStatusUpdateFrequency`, where N means number of retries allowed for kubelet to post node status. +|40 +|40 +|No + +|node-lifecycle-controller +|large-cluster-size-threshold +|Number of nodes from which node-lifecycle-controller treats the cluster as large for the eviction logic purposes. `--secondary-node-eviction-rate` is overridden to 0 for clusters this size or smaller. +|50 +|100,000 +|No + +|node-lifecycle-controller +|unhealthy-zone-threshold +|Percentage of nodes in a zone which need to be Not Ready for a zone to be treated as unhealthy. +|55% +|55% +|No + +|kubelet +|node-status-update-frequency +|How often kubelet posts node status to control plane. Be cautious when changing the constant, it must work with `nodeMonitorGracePeriod` in node-lifecycle-controller. +|10 +|10 +|Yes + +|kubelet +|node-labels +|Labels to add when registering the node in the cluster. The label `topology.kubernetes.io/zone` can be specified with hybrid nodes to group nodes into zones. +|None +|None +|Yes +|=== + +== Kubernetes pod failover through network disconnections + +The behavior described in this section assumes pods are running as Kubernetes Deployments with the default settings. The behavior described in this section assumes EKS is used as the Kubernetes provider. Please note, the actual behavior may vary based on the environment, nature of the network disconnection, applications, dependencies, and cluster configuration. The behavior described in this guide was validated using a specific application, cluster configuration, and subset of plugins. It is strongly recommended to test the behavior with your own applications and environments before deploying to production. + +When there are network disconnections between nodes and the Kubernetes control plane, the kubelet running the nodes cannot communicate with the Kubernetes control plane. Because of this, the kubelet cannot take action to evict pods on the nodes until the connection with the Kubernetes control plane is restored. This means that pods running on nodes before a network disconnection continue running during the disconnection, assuming there are no other failures that cause the pods to shutdown during the disconnection. In summary, you can achieve static stability during network disconnections between nodes and the Kubernetes control plane but you cannot perform mutating operations on your nodes or workloads during network disconnections. + +There are four scenarios that result in different pod failover behaviors based on the nature of the network disconnection. For all scenarios, we observed that the overall health of the cluster is automatically restored without operator intervention once the nodes reconnect to the Kubernetes control plane. The scenarios below contain expected results based on the observations of our testing but these expected results may not apply to all possible application and cluster configurations. + +=== Scenario 1: Full disruption + +*Expected result*: Pods running on unreachable nodes are not evicted and continue running on the same nodes. + +Full disruption means all nodes in the cluster are disconnected from the Kubernetes control plane. In this scenario, the node-lifecycle-controller running on the Kubernetes control plane detects that all nodes in the cluster are unreachable and cancels pod evictions. + +Cluster administrators will see all nodes with status `Unknown` during the disconnection. The pod status will not change and new pods will not be scheduled on other nodes during the disconnection and through reconnection. + +=== Scenario 2: Majority zone disruption + +*Expected Result*: Pods running on unreachable nodes are not evicted and continue running on the same nodes. + +Majority zone disruption means the majority of nodes in a zone are disconnected from the Kubernetes control plane. Zones in Kubernetes are defined by nodes with the same `topology.kubernetes.io/zone` label. If no zones are defined in the cluster then a majority means the majority of nodes in the cluster are disconnected. Majority is defined by the `unhealthy-zone-threshold` setting of the node-lifecycle-controller and is set to 55% by default in Kubernetes and in EKS. In this scenario, the node-lifecycle-controller considers the number of unreachable nodes as well as the number of nodes in the entire cluster to determine whether to evict pods from nodes. In the case of EKS, because `large-cluster-size-threshold` is set to 100,000, if 55% or more of nodes in a zone are unreachable, all pod evictions are cancelled because most clusters are smaller than 100,000 nodes. + +Cluster administrators will see a majority of nodes in the zone with status `Not Ready` during the disconnection, but the pod status will not change and pods will not be rescheduled on other nodes during the disconnection and through reconnection. + +Note, the behavior described above only applies for clusters larger than 3 nodes. In the case of clusters equal to or smaller than 3 nodes, pods on unreachable nodes will be scheduled for eviction and new pods will be scheduled on healthy nodes. + +Note, during testing iterations, it was observed that occasionally pods are evicted from 1, and only 1, of the unreachable nodes during network disconnections when a majority of nodes in a zone are unreachable. We are continuing to track down our suspicions of a race condition in the Kubernetes node-lifecycle-controller leading to this behavior. + +=== Scenario 3: Minority disruption + +*Expected Result*: Pods are evicted from unreachable nodes and new pods are scheduled on available, eligible nodes. + +Minority disruption means the minority of nodes in a zone are disconnected from the Kubernetes control plane. If no zones are defined in the cluster then a minority disruption means the minority of nodes in the cluster are disconnected. Minority is defined by the `unhealthy-zone-threshold` setting of the node-lifecycle-controller and is set to 55% by default in Kubernetes and in EKS. In this scenario, if the network disconnection lasts longer than 5 minutes (`default-unreachable-toleration-seconds`) and 40 seconds (`node-monitor-grace-period`), and less than 55% of nodes in a zone are unreachable, new pods will be scheduled on healthy nodes and the pods on the unreachable nodes in the zone will be scheduled for eviction. + +Cluster administrators will see new pods created on healthy nodes and the pods on the disconnected nodes will have status `Terminating`. As a reminder, even though the pods on the disconnected nodes have status `Terminating`, they will not be evicted from the node until the node reconnects to the Kubernetes control plane. + +=== Scenario 4: Node restart during network disruption + +*Expected Result*: Pods running on unreachable nodes are not started until the unreachable nodes reconnect to the Kubernetes control plane. The pod failover behavior is subject to the nature of the network disconnection as outlined in Scenarios 1-3. + +Node restart during network disruption means that there were multiple simultaneous failures, a network disconnection and then another event that caused the kubelet to restart such as a power cycle, out-of-memory error, or similar issues. In this scenario, the pods that were running on the node when the network disconnection happened will not be automatically restarted throughout the network disconnection if the kubelet was restarted. The reason for this is the kubelet contacts the Kubernetes API server during startup to learn which pods it should run. When the kubelet cannot contact the Kubernetes API server, as is the case during a network disconnection, it cannot get the information it needs to start the pods. + +In this scenario, local troubleshooting tools such as crictl cannot be used to manually start pods in a break-glass effort because Kubernetes follows a pattern of removing the failed pods and creating new pods rather than restarting existing pods (see https://github.com/containerd/containerd/pull/10213[#10213] in the containerd GitHub repo for more information). Static pods are the only Kubernetes workload object that are controlled by the kubelet and can be restarted during these scenarios. It is generally not recommended to use static pods for application deployments, and instead you should deploy multiple replicas across different hosts to maintain application availability in the event of an additional simultaneous failures during a network disconnection between your nodes and the Kubernetes control plane. \ No newline at end of file diff --git a/latest/bpg/images/hybrid/k8s-components-pod-failover.png b/latest/bpg/images/hybrid/k8s-components-pod-failover.png new file mode 100644 index 000000000..473d41b46 Binary files /dev/null and b/latest/bpg/images/hybrid/k8s-components-pod-failover.png differ diff --git a/latest/bpg/index.adoc b/latest/bpg/index.adoc index 54219e3eb..5176b7ed8 100644 --- a/latest/bpg/index.adoc +++ b/latest/bpg/index.adoc @@ -38,6 +38,7 @@ We currently have published guides for the following topics: * xref:cluster-upgrades[Best Practices for Cluster Upgrades] * xref:cost-opt[Best Practices for Cost Optimization] * xref:windows[Best Practices for Running Windows Containers] +* xref:hybrid[Best Practices for Hybrid Deployments] We also open sourced a Python based CLI (Command Line Interface) called https://github.com/aws-samples/hardeneks[hardeneks] to check some of the