Skip to content

How to fix AWS EKS "Instances failed to join the kubernetes cluster"? The Node is in Ready state on NodeGroup Page

0

Instances failed to join the kubernetes cluster Node is in ready state I have updated aws-auth file, and the Node is up and running. But still seeing this Instances failed to join the kubernetes cluster in NodeGroup Page, how to fix it?

asked 10 months ago392 views
3 Answers
0

This issue where you see "Instances failed to join the kubernetes cluster" despite nodes appearing in Ready state can be confusing. Here's how to troubleshoot and resolve it:

First, verify that your nodes are truly joining the cluster by running:

kubectl get nodes

If your nodes appear in this list and show as Ready, then they are actually connected to the cluster despite the error message in the console. This suggests there might be a disconnect between the actual state and what's displayed in the AWS console.

Common causes and solutions for this issue:

  1. UserData configuration issues: The bootstrap script in the UserData of the worker node must contain the correct EKS cluster name. If you're using a launch template, ensure the bootstrap script is properly configured with the correct cluster name.

  2. Launch template problems: If you're using a launch template and have deleted the reference template, the node group may show this error. In this case, you'll need to recreate the node group:

  • Launch a new node group
  • Verify all nodes are healthy and in Ready state
  • Drain your worker nodes from the old group
  • Delete the original node group
  1. Authentication issues: Even though the nodes appear to be running, there might be authentication issues between the nodes and the cluster. Ensure your IAM roles have the correct permissions.

  2. Check for NodeClaims issues (if using EKS Auto Mode):

kubectl get nodeclaim
kubectl describe nodeclaim <node_claim>

Look under Status to find any issues preventing proper node registration.

  1. Console synchronization: Sometimes the AWS console doesn't immediately reflect the actual state of your cluster. Try refreshing the console after some time.

Since your nodes are showing as Ready in kubectl, the issue might be primarily with the console display rather than actual functionality. If your workloads are running properly on these nodes, you might not need to take immediate action beyond ensuring your configuration is correct.
Sources
EKS node creation failed | AWS re:Post
Troubleshoot EKS Auto Mode - Amazon EKS
Troubleshoot Amazon EKS managed node groups and launch template issues | AWS re:Post

answered 10 months ago
  • 1.I am not using Launch template, I created the node group from EKS cluster UI in AWS Management Console 2. The result of node claim seems pointing to another node: ~ $ kubectl get nodeclaim NAME TYPE CAPACITY ZONE NODE READY AGE system-sv5h5 c6g.large on-demand us-east-1b i-0af5c5b5c005a4d5c True 9h 3. All of my nodes: ~ $ kubectl get nodes NAME STATUS ROLES AGE VERSION i-06cefaff7ebfe85fe Ready <none> 4d18h v1.33.0-eks-802817d i-07124e88bba60bad2 Ready <none> 2d v1.33.0-eks-802817d i-08aea1a0299550e45 Ready <none> 45h v1.33.0-eks-802817d i-0af5c5b5c005a4d5c Ready <none> 9h v1.33.1-eks-b9364f6

0

As mentioned in the documentation[1], When an EKS node group is created, if nodes in a managed node group fail to connect to the cluster within 15 minutes, EKS emits a "NodeCreationFailure" health issue and marks the node group status as "Create failed" in the console.

This initial failure occurs when launched instances are unable to register with your Amazon EKS cluster due to various reasons, such as insufficient node IAM role permissions or lack of outbound endpoints/cluster access for the nodes (detailed in https://repost.aws/knowledge-center/eks-worker-nodes-cluster).

Once the issues are fixed, the nodes might successfully join the cluster and come up into ready state since kubelet might continuously try to join the cluster. Kindly note that the nodegroup state will continue to remain in create failed state in the console even after the nodes join the cluster.

Since your nodes have successfully joined the cluster, this indicates that all corresponding issues have been fixed, I would suggest creating a new node group and shifting your workloads to the new nodes and delete the previous one. To gracefully shift your workload, you can follow the steps mentioned in the documentation[3].

References: [1] https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html [2] https://repost.aws/knowledge-center/eks-worker-nodes-cluster [3] https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/

AWS
answered 10 months ago
  • Recreating new node group doesn't help. And new node group is till in create failed state.

0
  1. I am not using Launch template, I created the node group from EKS cluster UI in AWS Management Console
  2. The result of node claim seems pointing to another node: ~ $ kubectl get nodeclaim NAME TYPE CAPACITY ZONE NODE READY AGE system-sv5h5 c6g.large on-demand us-east-1b i-0af5c5b5c005a4d5c True 9h
  3. All of my nodes: ~ $ kubectl get nodes NAME STATUS ROLES AGE VERSION i-06cefaff7ebfe85fe Ready <none> 4d18h v1.33.0-eks-802817d i-07124e88bba60bad2 Ready <none> 2d v1.33.0-eks-802817d i-08aea1a0299550e45 Ready <none> 45h v1.33.0-eks-802817d i-0af5c5b5c005a4d5c Ready <none> 9h v1.33.1-eks-b9364f6
answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.