Why is my Amazon EKS node group in the Degraded status?

7 minute read
0

The managed node group in my Amazon Elastic Kubernetes Service (Amazon EKS) cluster is in the Degraded status.

Resolution

Take the following troubleshooting actions based on the error message that you receive in NodegroupHealth.

AccessDenied

The AccessDenied error occurs when Amazon EKS or one of your managed nodes can't authenticate or authorize in your Kubernetes cluster API server. You receive an error similar to the following example:

"Your worker nodes do not have access to the cluster. Verify if the node instance role is present and correctly configured in the aws-auth ConfigMap."

To troubleshoot this issue, run the following command to confirm that the node instance role permissions are correct in aws-auth configmap:

kubectl get configmap aws-auth -n kube-system -o yaml

Or, confirm that you correctly mapped the node role to an access entry.

Your worker node instance role must be present and correctly configured. Make sure to map the node role only to system:bootstrappers or system:nodes. It's not a best practice to map the node role to the system:masters group.

You also receive AccessDenied errors when the role that's performing operations on the managed node groups doesn't have eks:node-manager ClusterRole or ClusterRoleBinding permissions. To troubleshoot this issue, update your role permissions.

If you use a private Windows Amazon Machine Image (AMI) to launch a managed node group, then you might receive the Not authorized for images error message. When AWS launches a new Windows AMI, AWS makes all AMIs earlier than 4 months private. To troubleshoot this issue, see Not authorized for images.

AmiIdNotFound

The AmiIdNotFound error occurs when Amazon EKS can't find the AMI ID associated with your launch template. You receive an error similar to the following example:

"AmiIdNotFound: The ami id '[ami-0cxx]' does not exist"

To troubleshoot this issue, make sure that the AMI ID that you added to your custom launch template exists. Also, make sure that you shared it with your AWS account.

AutoScalingGroupNotFound

The AutoScalingGroupNotFound error occurs when Amazon EKS can't find the Amazon Amazon Elastic Compute Cloud (EC2) Auto Scaling group associated with the managed node group. You receive an error similar to the following example:

"AutoScalingGroupNotFound - The Amazon AutoScalingGroup ASG Name was not found."

To troubleshoot this issue, make sure that you didn't delete the EC2 Auto Scaling group associated with the managed node group. If you accidentally deleted the EC2 Auto Scaling group, then create an EC2 Auto Scaling group with the same name. Wait a few minutes, and then check whether the node group is back in the Active status.

ClusterUnreachable

The ClusterUnreachable error occurs when Amazon EKS or your managed nodes can't communicate with your Kubernetes cluster API server. You receive an error similar to the following example:

"Ec2SecurityGroupNotFound You must use a valid fully-formed launch template. The security group 'sg-09fxx' does not exist in VPC 'vpc-0a8cxx'"

This error typically occurs because of network disruptions or API servers that time out when they submit requests. This error also occurs if you exceed the 8 GB quota for your etcd database size. To troubleshoot this issue, see Managing etcd database size on Amazon EKS clusters. Also, make sure the core add-ons, such as kube-proxy, Amazon Virtual Private Cloud (Amazon VPC) CNI, and CoreDNS, are up-to-date with the latest version.

AutoScalingGroupInvalidConfiguration

The AutoScalingGroupInvalidConfiguration error occurs when you incorrectly configure the managed node group's EC2 Auto Scaling group. You receive an error similar to the following example:

"AutoScalingGroupInvalidConfiguration: The Amazon AutoScalingGroup ASG Name has subnets ([Incorrect Subnet ID 1, Incorrect Subnet ID 2, Incorrect Subnet ID 3]) which is not expected by Amazon EKS. Expected subnets : ([Correct Subnet ID 1, Correct Subnet ID 2, Correct Subnet ID 3])."

To troubleshoot this issue, identify and remove changes to the EC2 Auto Scaling group. Make sure that the associated subnets haven't changed. Update the EC2 Auto Scaling associated with your node group to use the subnets listed in the error message.

It's not a best practice to manually update the EC2 Auto Scaling group that you associated with the managed node group. Only make a manual change to revert manual changes that you previously made.

Ec2SecurityGroupNotFound

The Ec2SecurityGroupNotFound error occurs when Amazon EKS can't find the cluster security group. You receive an error similar to the following example:

"Ec2SecurityGroupNotFound The Amazon EC2 Security Group sg-04f3xx for node group-Name was not found."

If you receive this error message, then you can no longer use the managed node group in the Degraded status. Instead, you must launch a new node group. Then, drain and delete the previous node group. For more information about how to drain a node group, see Safely drain a node on the Kubernetes website.

Ec2LaunchTemplateNotFound

The Ec2LaunchTemplateNotFound error occurs when the Amazon Elastic Compute Cloud (Amazon EC2) launch template for your managed node group doesn't match the version that Amazon EKS created. You receive an error similar to the following example:

"The Amazon EC2 Launch Template lt-0cdac3xxf version number was not found."

Amazon EKS deploys managed groups with a managed launch template that's associated with the underlying managed EC2 Auto Scaling group.

If you accidentally deleted the launch template, then it's a best practice to launch a new node group. Then, drain and delete the previous node group. For more information about how to drain a node group, see Safely drain a node on the Kubernetes website.

It's not a best practice to manually update the EC2 Auto Scaling group that you associated with the managed node group. Only make a manual change to revert manual changes that you previously made

Ec2LaunchTemplateVersionMismatch

The Ec2LaunchTemplateVersionMismatch error occurs when the managed node group's EC2 Auto Scaling group launch template version doesn't match the version that Amazon EKS created. You receive an error similar to the following example:

"The Amazon EC2 Launch Template : lt-0cdacxx has a new version (number) associated with your Autoscaling group, which is not managed by Amazon EKS. Expected Launch Template version : (number) lt-0cdac39f3axx"

Amazon EKS always deploys managed node groups with a managed launch template. If you don't provide a launch template, then Amazon EKS automatically creates one with your account's default values. It's not a best practice to modify the automatically generated template. You also can't directly update existing node groups that don't use a custom launch template. Instead, you must create a new node group with a custom launch template.

To resolve the Ec2LaunchTemplateVersionMismatch error for a custom launch template, update the launch template version to the expected launch template version noted in the error message.

For information about the allowed actions in a launch template for Amazon EKS node groups, see Launch template configuration basics.

AsgInstanceLaunchFailures

The AsgInstanceLaunchFailures error occurs when your EC2 Auto Scaling group can't launch instances. You receive an error similar to the following example:

"AsgInstanceLaunchFailures Could not launch Spot Instances. UnfulfillableCapacity - Unable to fulfill capacity due to your request configuration. Please adjust your request and try again. Launching EC2 instance failed."

This error typically occurs when there isn't enough Spot Instance capacity for your requested instance type. It's a best practice to use a mix of instance types and families from different Spot Instance pools to get capacity. For more information, see How do I launch and troubleshoot Spot Instances using Amazon EKS managed node groups?

InsufficientFreeAddresses

The InsufficientFreeAddresses error occurs when a subnet associated with your managed node group doesn't have enough available IP addresses for new nodes. You receive an error similar to the following example:

"InsufficientFreeAddresses - Amazon AutoScaling was unable to launch instances because there are not enough free addresses in the subnet associated with your AutoScaling group(s)."

To resolve this issue, make sure that you have enough IP addresses in the subnets associated with the managed node group. For more information, see Optimizing IP address utilization.

IamInstanceProfileNotFound or IamNodeRoleNotFound

The IamInstanceProfileNotFound or IamNodeRoleNotFound errors occur when you delete the AWS Identity and Access Management (IAM) role or profile that you associated with the managed node group. To resolve this issue, create a new IAM role or profile with the same name and settings as the deleted role or profile.

AWS OFFICIAL
AWS OFFICIALUpdated a month ago