- Newest
- Most votes
- Most comments
The issue you're experiencing with nodes failing to join the EKS cluster could be due to several factors. Here are some potential causes and solutions:
-
DHCP Options: Ensure that your VPC has the correct DHCP options set. Your script creates DHCP options, but make sure they include both domain-name and domain-name-servers. The domain-name should be set to the region-specific compute.internal value (e.g., ec2.internal for us-east-1).
-
DNS Resolution: Verify that DNS hostnames and DNS resolution are enabled for your VPC. Your script enables these, but double-check to ensure they're properly set.
-
Security Group Rules: Review your security group configurations. Ensure that the node security group allows all traffic from the cluster security group and itself. Your script seems to set this up correctly, but verify that these rules are in place.
-
VPC Endpoints: Confirm that you have the necessary VPC endpoints set up correctly. Your script creates endpoints for EKS, ECR API, and ECR DKR, which is good. Make sure these endpoints are associated with the correct subnets and security groups.
-
IAM Roles: Verify that the IAM roles for both the cluster and nodes have the correct permissions. Your script attaches the necessary policies, but ensure they are properly propagated before creating the cluster and nodes.
-
Subnet Configuration: Check that your subnets are properly tagged for EKS use. Your script adds the required tags, but verify they are correct.
-
Bootstrap Script: Although not explicitly shown in your script, ensure that the node UserData includes the correct bootstrap script with the proper cluster name.
-
Network Connectivity: Verify that the nodes can reach the cluster API server. Since you're using a private cluster, ensure that the VPC endpoints are correctly set up to allow this communication.
-
AWS CLI Version: Make sure you're using the latest version of the AWS CLI, as older versions might not support all the required options for EKS.
To troubleshoot further:
-
Check the EC2 instance logs for the failed nodes. You can access these through the EC2 console or using Systems Manager Session Manager if configured.
-
Review the CloudWatch logs for the EKS control plane if you have logging enabled.
-
Try to manually join a node to the cluster to see if you can identify any specific errors during the join process.
-
Consider enabling EKS control plane logging to get more detailed information about cluster operations.
If the issue persists after checking these areas, you may need to engage AWS support for further assistance, providing them with the detailed logs and the steps you've taken to troubleshoot.
Sources
Adding Nodes in EKS getting >>nodecreationfailure eks - Nodes instances failed to join the kubernetes cluster | AWS re:Post
EKS node creation failed | AWS re:Post
Using custom AMI ID for EKS Nodegroup | AWS re:Post
Troubleshoot problems with Amazon EKS clusters and nodes - Amazon EKS
Quick look, script appears fine except;
- the security group allowed port is limited to ICMP (-1), I think. You might want to use --port 0-65535 https://docs.aws.amazon.com/cli/latest/reference/ec2/authorize-security-group-ingress.html
- and the above being the case, if default outbound rule is not restricted, the cluster DNS need should be covered in your node-to-node connection (port 53 udp/tcp), and your node-to-cluster (port 443 for getting cluster dns information), so this hardocded cidr might be made redundant
# Node SG: Allow DNS (TCP/UDP 53) within VPC
aws ec2 authorize-security-group-ingress --group-id "$NODE_SG_ID" --protocol tcp --port 53 --cidr 10.1.0.0/16
aws ec2 authorize-security-group-ingress --group-id "$NODE_SG_ID" --protocol udp --port 53 --cidr 10.1.0.0/16
- checking the reason on the instances failing to join the cluster, is the precise troubleshooting step;
- you may want to check the EC2 console logs available to you - https://docs.aws.amazon.com/cli/latest/reference/ec2/get-console-output.html
- more useful, you should look to configure ssm/ssh for initial standup troubleshooting and check the
Relevant content
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated a year ago
