NodeCreationFailure - Instances failed to join the kubernetes cluster

0

Generic problem with MANAGED node group instances getting created but not able to join the cluster. Even choosing default VPC which has less restrictive security groups/NACL-s result into the same issue. My cluster endpoint are both public and private and everytime after 25 mins or so node group provisioning fails with the error "NodeCreationFailure - Instances failed to join the kubernetes cluster". I have checked all the troubleshooting steps mentioned in the guide wrt DNS, roles, security groups, etc. Any insights would be greatly appreciated.

SM
asked 2 years ago1390 views
3 Answers
0

Hello SM,

NodeCreationFailure can happen due to multiple reasons. Here is a troubleshooting guide that explains various things to look at.

You can start by looking at the kubelet logs in the node instance to find out what caused the node bootstrapping failure. You can run the below command to get the node logs.

journalctl -f -u kubelet

If you see any network timeout errors in the kubelet logs, it could mean that there is a networking issue. Also, check if the UserData script is actually invoking the bootstrap.sh script during the node startup.

Check the network connectivity between API server and the node instance by running the below command on the node instance. It should return a 403 error.

curl -Ivk <API-Server-URL>

If you are still facing issues finding the root cause after going through the troubleshooting guide, please feel free to reach out to AWS Support for assistance.

Thank you!

profile pictureAWS
SUPPORT ENGINEER
answered 2 years ago
0

Thanks for the inputs. Unfortunately, even though I have brought up my managed node group instances on public subnet for testing purposes and the instance does have public ip enabled with ssh access allowed via security group, I am still not able to log in to the instance and getting a connection timed out. Not sure why that is happening, I was also trying to telnet the public ip on port 22 to see if it gets connected since I had it open for the world for telnet testing and even telnet timed out which leads me to think that ssh daemon may not be up on the node group instance. On the other hand, I was able to check the system logs from console to find this as below(few details redacted) which does confirm that the bootstrap.sh script is being invoked.

[ 31.516263] cloud-init[2540]: Cloud-init v. 19.3-45.amzn2 running 'modules:final' at Thu, 30 Jun 2022 20:13:50 +0000. Up 31.43 seconds. [ 31.533559] cloud-init[2540]: + B64_CLUSTER_CA=abcd123 [ 31.566492] cloud-init[2540]: + API_SERVER_URL=https://abcd123.gr7.us-east-1.eks.amazonaws.com [ 31.566850] cloud-init[2540]: + K8S_CLUSTER_DNS_IP=abcd123 [ 31.567077] cloud-init[2540]: + /etc/eks/bootstrap.sh tst-demo-eks --kubelet-extra-args --node-labels=eks.amazonaws.com/nodegroup-image=ami-049924d678af7a43b,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup=tmp-ng --b64-cluster-ca abcd123 --apiserver-endpoint https://abdc123.gr7.us-east-1.eks.amazonaws.com --dns-cluster-ip abcd123 [ 31.724475] cloud-init[2540]: ‘/etc/eks/iptables-restore.service’ -> ‘/etc/systemd/system/iptables-restore.serviceâ

SM
answered 2 years ago
0

Hello again,

You can try to login to the instance using session manager. EKS Optimized AMI comes with in-built SSM agent and all you need to do is add the AmazonSSMManagedInstanceCore IAM policy to your node group IAM role. Please see here for instructions.

Also, make sure that your NACL rules are allowing access to port 22 for your public subnet.

If none of this works, please reach out to AWS Support and we will be glad to assist you!

profile pictureAWS
SUPPORT ENGINEER
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions