Skip to content

EKS Nodes go into Unknown Status after 20 hours

1

We are in the process of moving to EKS from Docker Swarm. We are able to deploy applications and properly do all the things we need to do, but after the nodes have been running for 20 hours they go into an "Unkown" Status. I've gone through this article and all the IAM policies seem as they should be, based on all the resources for IAM permissions to run EKS.

If I create an EKS Auto Mode cluster the node(s) will never go into an "Unknown" status.

Here's a copy of my config:

---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: dev
  region: us-east-1

iamIdentityMappings:
  - arn: arn:aws:iam::0000000000:role/MyDevOps
    groups:
      - eks-console-dashboard-full-access-group
    noDuplicateARNs: true

iam:
  serviceRoleARN: arn:aws:iam::0000000000:role/myEksServiceRole
  vpcResourceControllerPolicy: true
  withOIDC: false
  podIdentityAssociations:
  - namespace: kube-system
    serviceAccountName: aws-node
    roleARN: arn:aws:iam::000000000:role/MyPodIdentityAccessRole
  - namespace: kube-system
    serviceAccountName: efs-csi-controller-sa
    roleARN: arn:aws:iam::0000000000:role/MyPodIdentityAccessRole

cloudWatch:
  clusterLogging:
    enableTypes: ["*"]

vpc:
  subnets:
    private:
      us-east-1c:
        id: "subnet-id-here"
        cidr: "10.1.8.0/24"

      us-east-1d:
        id: "subnet-id-here"
        cidr: "10.1.9.0/24"

      us-east-1e:
        id: "subnet-id-here"
        cidr: "10.1.10.0/24"

managedNodeGroups:
  - name: dev-worker
    instanceType: m5.xlarge
    desiredCapacity: 2
    disableIMDSv1: true
    privateNetworking: true 
    volumeSize: 100
    securityGroups:
      attachIDs: [ .... ]
    iam:
      instanceRoleARN: arn:aws:iam::00000000000:role/MyEKSNodeInstanceRole
    labels: {role: worker}
    tags:
      Name: dev-worker
      nodegroup-role: worker
      Environment: dev
      Application: mm

addons:
  - name: eks-pod-identity-agent
  - name: aws-efs-csi-driver

I also noticed after doing a journalctl -u kubelet that at the 20-hour mark I start getting the following:

Failed to update lease

Another way to get the nodes in an "Unknown" status is to upgrade all the add-ons that need upgraded.

Has anyone else experienced this? Should the "Node Monitoring Agent" be installed as well?

Thanks for any help ANYONE can provide.

2 Answers
0

The issue you're experiencing with EKS nodes going into "Unknown" status after 20 hours of running is concerning and could be caused by several factors. Let's explore some potential causes and solutions:

  1. Communication issues: The "Unknown" status typically indicates that the Kubernetes control plane has lost communication with the node. This could be due to network problems, security group configurations, or VPC settings preventing proper communication.

  2. Kubelet issues: The error message you're seeing in the kubelet logs suggests there might be a problem with the node's ability to update its lease. This could be related to API server connectivity or permissions issues.

  3. IAM roles and permissions: While you mentioned that IAM policies seem correct, it's worth double-checking that the node instance role (MyEKSNodeInstanceRole) has all the necessary permissions to communicate with the EKS control plane and other AWS services.

  4. Resource constraints: Although less likely given the instance type (m5.xlarge), check if there are any resource constraints (CPU, memory, or disk space) that might be affecting the node's ability to function properly after extended periods of operation.

  5. Add-on compatibility: The fact that upgrading add-ons can trigger the "Unknown" status suggests there might be compatibility issues between your cluster version, node version, and add-on versions.

To troubleshoot and potentially resolve the issue:

  1. Review the kubelet logs more thoroughly to identify any recurring errors or patterns leading up to the "Unknown" status.

  2. Verify network connectivity between the nodes and the EKS control plane, ensuring all necessary ports are open and security groups are correctly configured.

  3. Check the AWS VPC CNI plugin functionality, as networking issues can cause node communication problems.

  4. Consider enabling the Node Problem Detector, which can help identify and report node-level issues more effectively.

  5. Ensure that your EKS cluster, node group, and add-on versions are all compatible and up-to-date.

  6. If the issue persists, you may need to consider creating a new node group with updated configurations and gradually migrating workloads to the new nodes.

  7. As a temporary workaround, you could set up an automated process to recycle nodes before the 20-hour mark to prevent them from entering the "Unknown" state.

If these steps don't resolve the issue, it may be necessary to engage AWS support for more in-depth diagnostics, especially since the problem doesn't occur in EKS Auto Mode clusters. They can provide more detailed insights into the underlying causes and potential solutions specific to your cluster configuration.
Sources
Node in EKS Cluster has status Unknown | AWS re:Post
EKS node creation failed | AWS re:Post
Node problem detection and recovery for AWS Neuron nodes within Amazon EKS clusters | AWS Machine Learning Blog

answered a year ago
-1

most likely IAM -

aws iam list-attached-role-policies --role-name <node-role-name>

AWS
answered a year ago
  • Thats well and good but what should it have? My instance role has all the policies that are created with the EKS Auto Mode cluster.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.