- Newest
- Most votes
- Most comments
The issue you're experiencing with EKS nodes going into "Unknown" status after 20 hours of running is concerning and could be caused by several factors. Let's explore some potential causes and solutions:
-
Communication issues: The "Unknown" status typically indicates that the Kubernetes control plane has lost communication with the node. This could be due to network problems, security group configurations, or VPC settings preventing proper communication.
-
Kubelet issues: The error message you're seeing in the kubelet logs suggests there might be a problem with the node's ability to update its lease. This could be related to API server connectivity or permissions issues.
-
IAM roles and permissions: While you mentioned that IAM policies seem correct, it's worth double-checking that the node instance role (MyEKSNodeInstanceRole) has all the necessary permissions to communicate with the EKS control plane and other AWS services.
-
Resource constraints: Although less likely given the instance type (m5.xlarge), check if there are any resource constraints (CPU, memory, or disk space) that might be affecting the node's ability to function properly after extended periods of operation.
-
Add-on compatibility: The fact that upgrading add-ons can trigger the "Unknown" status suggests there might be compatibility issues between your cluster version, node version, and add-on versions.
To troubleshoot and potentially resolve the issue:
-
Review the kubelet logs more thoroughly to identify any recurring errors or patterns leading up to the "Unknown" status.
-
Verify network connectivity between the nodes and the EKS control plane, ensuring all necessary ports are open and security groups are correctly configured.
-
Check the AWS VPC CNI plugin functionality, as networking issues can cause node communication problems.
-
Consider enabling the Node Problem Detector, which can help identify and report node-level issues more effectively.
-
Ensure that your EKS cluster, node group, and add-on versions are all compatible and up-to-date.
-
If the issue persists, you may need to consider creating a new node group with updated configurations and gradually migrating workloads to the new nodes.
-
As a temporary workaround, you could set up an automated process to recycle nodes before the 20-hour mark to prevent them from entering the "Unknown" state.
If these steps don't resolve the issue, it may be necessary to engage AWS support for more in-depth diagnostics, especially since the problem doesn't occur in EKS Auto Mode clusters. They can provide more detailed insights into the underlying causes and potential solutions specific to your cluster configuration.
Sources
Node in EKS Cluster has status Unknown | AWS re:Post
EKS node creation failed | AWS re:Post
Node problem detection and recovery for AWS Neuron nodes within Amazon EKS clusters | AWS Machine Learning Blog
most likely IAM -
aws iam list-attached-role-policies --role-name <node-role-name>
Relevant content
- asked 3 years ago
- asked 3 years ago
- asked a year ago
- AWS OFFICIALUpdated 2 years ago

Thats well and good but what should it have? My instance role has all the policies that are created with the EKS Auto Mode cluster.