How do I troubleshoot common issues with Amazon EKS node group update failures?

5 minute read
0

I want to use the newest Amazon Machine Image (AMI) versions to update my Amazon Elastic Kubernetes Service (Amazon EKS) node groups.

Short description

When you initiate a managed node group update, Amazon EKS automatically updates your nodes. If you use an Amazon EKS-optimized AMI, then Amazon EKS automatically applies the latest security patches and operating system updates to your nodes.

During this update process, you might see any of the following errors. Follow the relevant troubleshooting steps for the error that you encounter. For more information on update errors, see Understand each phase of node updates.

Resolution

The update failed because of PodEvictionFailure

The following error indicates that a PodEvictionFailure blocked the upgrade:

"Error message : Reached max retries while trying to evict pods from nodes in node group."

If the pods don't leave the node within 15 minutes and there's no force flag, then the upgrade phase fails with a PodEvictionFailure error.

A PodEvictionFailure error might occur for any of the following reasons:

Aggressive PDB (Pod Disruption Budget)

When there are multiple PDBs that point to the same pod, the pod has a definition of Aggressive PDB.

PDB indicates the number of disruptions that a class of pods can tolerate at a given time, or a budget of faults. When a voluntary disruption causes the pods for a service to drop below the budget, the operation pauses until it can maintain the budget. The node drain event temporarily halts until more pods become available. This halt makes sure that you don't evict the pods and exceed the budget. For more information, see Disruptions on the Kubernetes website.

To confirm a smooth managed node group update, either remove the pod disruption budgets temporarily or use the force option to update. This option doesn't consider pod disruption budgets. Instead, it forces the nodes to restart and implement the updates.

Note: If the app is a Quorum-based application, then the force option might cause the application to become temporarily unavailable.

To confirm that you have PDBs configured in your cluster, run the following command:

$ kubectl get pdb --all-namespaces

Or, if you turned on audit logging in the Amazon EKS console, then complete the following steps:

  1. Under the Clusters tab, choose the desired cluster (for example, rpam-eks-qa2-control-plane) from the list.

  2. Under the Logging tab, choose Audit. This action redirects you to the Amazon CloudWatch console.

  3. In the CloudWatch console, choose Logs. Then, choose Log Insights and run the following query:

    fields @timestamp, @message| filter @logStream like "kube-apiserver-audit"| filter ispresent(requestURI)
    | filter objectRef.subresource = "eviction" 
    | display @logStream, requestURI, responseObject.message
    | stats count(*) as retry by requestURI, responseObject.message
  4. Choose Custom to identify the date for the update. If there's a node group update failure due to aggressive PDB, then resposeObject.message describes the reason for the pod eviction failure.

  5. If PDB caused the failure, then modify the PDB. Run the following command, and then upgrade the node group again:

    $ kubectl edit pdb pdb_name;

Deployment tolerating all the taints

After all pods are evicted, the node becomes empty because the node was tainted in the earlier steps. However, if the deployment tolerates every taint, then the node is more likely to be non-empty. This leads to a pod eviction failure. For more information, see Taints and tolerations on the Kubernetes website.

The update failed because of a non-valid release version

If you have a release version that's not valid, then you might receive the following error:

"Error Message: Requested release version 1.xx is not valid for kubernetes version 1.xx."

To resolve this issue, run the upgrade command again. This command upgrades the node groups to the same version as the control plane's Kubernetes version:

eksctl upgrade nodegroup --name=test --cluster=test --kubernetes-version=1.xx

Note: Replace 1.xx version with the version supported by Amazon EKS control plane.

The update failed because the node group has health issues

If your node group has health issues, then a failed update returns the following error:

"Error message: Nodegroup has health issue other than [ AsgInstanceLaunchFailures, InstanceLimitExceeded, InsufficientFreeAddresses, ClusterUnreachable]"

This indicates that the node group's Auto Scaling group can't find the expected version of its Amazon Elastic Compute Cloud (Amazon EC2) launch template. This error occurs if you manually modify or delete the version of the template that's associated with the node group. This causes EKS to show the node group as degraded.

If you didn't delete the launch template, then manually change the launch template version of the Auto Scaling group back to the appropriate version. This action reverts the node group to a healthy and active state, and you can reinitiate the update process.

The update failed because new nodes aren't joining the node group

This issue occurs if the node group's new nodes can't join the cluster. As a result, the node group rolls back to its previous version. In this case, you might see the following error:

"NodeCreationFailure

Couldn't proceed with upgrade process as new nodes are not joining node group ng-backend"

There are multiple reasons why updated nodes can't join the cluster. To troubleshoot this issue, see How can I get my worker nodes to join my Amazon EKS cluster?

AWS OFFICIAL
AWS OFFICIALUpdated a month ago
2 Comments

Any guideline for below error messages?

NodeCreationFailure
Couldn't proceed with upgrade process as new nodes are not joining node group
profile picture
replied a year ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied a year ago