How do I troubleshoot common issues with Amazon EKS node group update failures?

5 minute read

I want to update my Amazon Elastic Kubernetes Service (Amazon EKS) node groups using the newest Amazon Machine Image (AMI) versions.

Short description

Newer releases of Amazon EKS also include new versions of Amazon AMIs for node group updates. Customers with their workload deployed over multiple node groups face the challenge of updating their nodes to keep up with their release cycle.

When you initiate a managed node group update, Amazon EKS automatically updates your nodes for you. If you're using an Amazon EKS-optimized AMI, then Amazon EKS automatically applies the latest security patches and operating system updates to your nodes as part of the latest AMI release version. To implement the update, AWS Auto Scaling launches the nodes in every Availability Zone where the nodes are present in the node group. This service also rebalances the Availability Zone. Because the existing nodes drain only once, the launch step succeeds. The scale-down phase decrements the Auto Scaling group's maximum size and desired size by one to return to the values before the update. See "Scale down phase" in Managed node update behavior for more information.


During this update process, you might see some of the following errors that require their own mitigation steps. Being aware of these issues in advance lets you minimize downtime. See Managed node update behavior for more information on update errors.

Update failed due to PodEvictionFailure

Error message : Reached max retries while trying to evict pods from nodes in node group.

This error indicates that the upgrade is blocked by PodEvictionFailure. If the pods don't leave the node within 15 minutes and there's no force flag, then the upgrade phase fails with a PodEvictionFailure error.

The following are reasons for PodEvictionFailure error in the upgrade phase:

Aggressive PDB (Pod Disruption Budget)

Aggressive PDB is defined on the pod when there are multiple PDBs pointing to the same pod.

PDB indicates the number of disruptions that can be tolerated at a given time for a class of pods (a budget of faults). Whenever a voluntary disruption causes the pods for a service to drop below the budget, the operation pauses until it can maintain the budget. The node drain event halts temporarily until more pods become available so that the budget isn’t crossed by evicting the pods. For more information, see Disruptions on the Kubernetes website.

To confirm a smooth managed node group update, you must either remove the pod disruption budgets temporarily or use the force option to update. This option doesn't respect pod disruption budgets. Instead, this option implements the updates by forcing the nodes to restart.

Note: If the app is a Quorum-based application, then the force option can cause the application to become temporarily unavailable.

Run the following command to confirm that you have PDBs configured in your cluster:

$ kubectl get pdb --all-namespaces

Or, if you turned on audit logging in the Amazon EKS console, follow these steps:

1.    Under the Clusters tab, choose the desired cluster (for example, rpam-eks-qa2-control-plane) from the list.

2.    Under the Logging tab, choose Audit. This redirects you to the Amazon CloudWatch console.

3.    In the CloudWatch console, choose Logs. Then, choose Log Insights and run the following query:

fields @timestamp, @message
| filter @logStream like "kube-apiserver-audit"
| filter ispresent(requestURI)
| filter objectRef.subresource = "eviction" 
| display @logStream, requestURI, responseObject.message
| stats count(*) as retry by requestURI, responseObject.message

4.    Select Custom from the top right to identify the date for the update. If there’s a node group update failure due to aggressive PDB, resposeObject.message describes the reason for the pod eviction failure.

5.    If PDB caused the failure, modify the PDB using the following command. Then, upgrade the node group again:

$ kubectl edit pdb pdb_name;

Deployment tolerating all the taints

After every pod is evicted, the node becomes empty because the node was tainted in the earlier steps. However, if the deployment tolerates every taint, then the node is more likely to be non-empty, leading to a pod eviction failure. See Taints and tolerations on the Kubernetes website for more information.

Update failed due to non-valid release version

Error Message: Requested release version 1.xx is not valid for kubernetes version 1.xx.

To resolve this issue, run the upgrade command again. This command upgrades the node groups to the same version as the control plane's Kubernetes version:

eksctl upgrade nodegroup --name=test --cluster=test --kubernetes-version=1.xx

Note: Replace 1.xx version with the version supported by Amazon EKS control plane.

Update failed as node group has health issues

Error message: Nodegroup has health issue other than [ AsgInstanceLaunchFailures, InstanceLimitExceeded, InsufficientFreeAddresses, ClusterUnreachable]

This failure occurs if you manually modified the Auto Scaling group of the node group to use a different version of its Amazon Elastic Compute Cloud (Amazon EC2) launch template. Or, you might have deleted the version of the template associated with the node group. The EKS node group uses a default launch template that conflicts with the Auto Scaling group's launch template. This launch template causes EKS to show the node group as degraded.

If you haven't deleted the launch template yet, manually change the launch template version of the Auto Scaling group back to the appropriate version. This action reverses the node group to a healthy and active state. You can now reinitiate the update process.

AWS OFFICIALUpdated 2 months ago