Skip to content

How do I troubleshoot Amazon EKS managed node group creation failures?

6 minute read
0

My Amazon Elastic Kubernetes Service (Amazon EKS) managed node group failed to create. Nodes can't join the cluster, and I received the "Instances failed to join the kubernetes cluster" error.

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

Run the AWSSupport-TroubleshootEKSWorkerNode automation runbook

Prerequisites: Your worker nodes must have permission to access AWS Systems Manager and Systems Manager must be running. To grant permissions, use the AmazonSSMManagedInstanceCore AWS managed policy. Attach the policy to the AWS Identity and Access Management (IAM) role that corresponds to your Amazon Elastic Compute Cloud (EC2) instance profile. For more information, see the To add instance profile permissions for Systems Manager to an existing role (console) section of Alternative configuration for EC2 instance permissions.

To use the AWSSupport-TroubleshootEKSWorkerNode runbook to troubleshoot issues, complete the following steps:

  1. Open the runbook.
  2. Make sure that the AWS Region in the AWS Management Console is the same as your Amazon EKS cluster's Region.
    Note: Review the Runbook details section of the runbook for more information.
  3. In the Input parameters section, enter the name of your cluster for ClusterName and your instance ID for WorkerID.
  4. (Optional) For AutomationAssumeRole, select the IAM role to allow Systems Manager to perform actions. If you don't specify a role, then Systems Manager uses your current IAM entity's permissions to perform the actions in the runbook.
  5. Choose Execute.
  6. Check the Outputs to identify why your worker node can't join your cluster and the steps that you can take to resolve the error.

Check your worker node security group traffic requirements

Confirm that you configured your control plane's security group and worker node security group with the requirements for inbound and outbound traffic. By default, Amazon EKS applies the cluster security group to the instances in your node group to facilitate communication between nodes and the control plane. If you specify custom security groups in the launch template for your managed node group, then Amazon EKS doesn't add the cluster security group.

Check your worker node's IAM permissions

Verify that you attached the AmazonEKSWorkerNodePolicy and AmazonEC2ContainerRegistryReadOnly policies to the instance IAM role that you associated with your worker node.

Important: It's a best practice to attach the AmazonEKS_CNI_Policy to an IAM role that's associated with the aws-node Kubernetes service account in the kube-system namespace. However, you can attach the policy to the node instance role instead, if needed.

Confirm that the Amazon VPC for your cluster has support for a DNS hostname and resolution

After you configure private access for your Amazon EKS cluster endpoint, activate a DNS hostname and DNS resolution for your Amazon Virtual Private Cloud (Amazon VPC). When you activate endpoint private access, Amazon EKS creates an Amazon Route 53 private hosted zone, and then associates it with your cluster's Amazon VPC. For more information, see Cluster API server endpoint.

Update the aws-auth ConfigMap with your worker nodes' NodeInstanceRole

Verify that you correctly configured the aws-auth ConfigMap with your worker nodes' IAM role instead of the instance profile.

Set the tags for your worker nodes

For the Tag property of your worker nodes, set key to kubernetes.io/cluster/clusterName and value to owned.

Confirm that the Amazon VPC subnets for the worker node have available IP addresses

If your Amazon VPC runs out of IP addresses, then associate a secondary Classless Inter-Domain Routing (CIDR) block with your existing Amazon VPC. For more information, see View Amazon EKS networking requirements for VPC and subnets.

Confirm that your Amazon EKS worker nodes can reach the API server endpoint for you cluster

You can launch worker nodes in any subnet within your cluster VPC or peered subnet if there's an internet route through the following gateways:

  • NAT
  • Internet
  • Transit

If you launched your worker nodes in a restricted private network, then confirm that your worker nodes can reach the Amazon EKS API server endpoint. Make sure that you meet the requirements to run Amazon EKS in a private cluster without outbound internet access.

Note: You might have nodes in a private subnet that's backed by a NAT gateway. In this scenario, it's a best practice to create the NAT gateway in a public subnet.

If you don't use AWS PrivateLink endpoints, then verify access to API endpoints through a proxy server for the following AWS services:

  • Amazon EC2
  • Amazon Elastic Container Registry (Amazon ECR)
  • Amazon Simple Storage Service (Amazon S3)

To verify that the worker node has access to the API server, use SSH to connect, and then run the following netcat command:

nc -vz 9FCF4EA77D81408ED82517B9B7E60D52.yl4.eu-north-1.eks.amazonaws.com 443

Note: Replace 9FCF4EA77D81408ED82517B9B7E60D52.yl4.eu-north-1.eks.amazonaws.com with your API server endpoint.

To check the kubelet logs before you disconnect from your instance, run the following command:

journalctl -f -u kubelet

If the kubelet logs don't provide information about the source of the issue, then run the following command to check the worker node's kubelet status:

sudo systemctl status kubelet

Review your Amazon EKS logs and the operating system (OS) logs for additional troubleshooting steps.

Verify that the API endpoints can reach your Region

Use SSH to connect to one of the worker nodes, and then run the following commands for each service:

  • Amazon EC2

    nc -vz ec2.example-region.amazonaws.com 443
  • Amazon ECR

    nc -vz ecr.example-region.amazonaws.com 443
  • Amazon S3

    nc -vz s3.example-region.amazonaws.com 443

Note: Replace example-region with the Region for your worker node.

Configure the user data for your worker node

For managed node group launch templates with a specified Amazon Machine Image (AMI), you must supply bootstrap commands for worker nodes to join your cluster. Amazon EKS doesn't merge the default bootstrap commands into your user data. For more information, see Introducing launch template and custom AMI support in Amazon EKS managed node groups.

To configure user data, complete the following steps:

  1. Run the following describe-cluster AWS CLI command to retrieve the necessary data:
    aws eks describe-cluster --name example-clustername --query cluster.{name: name, endpoint: endpoint, certAuth: certificateAuthority.data, serviceIpv4Cidr: kubernetesNetworkConfig.serviceIpv4Cidr}
    Note: Replace example-clustername with your cluster's name.
  2. In the output note your cluster's API server endpoint, certificate authority, and service CIDR.
  3. Add the following configuration to your user data. For instructions on how to add the configuration, see the Amazon Linux 2023 user data section of Amazon EC2 User Data:
    ---  
    apiVersion: node.eks.aws/v1alpha1  
    kind: NodeConfig  
    spec:  
      cluster:  
        name: example-clustername  
        apiServerEndpoint: example-api-server-endpoint  
        certificateAuthority: example-certificate-authority   
        cidr: example-service-cidr    
    ---
    Note: Replace example-clustername with your cluster's name, example-api-server-endpoint with your cluster's API server endpoint, and example-certificate-authority with your cluster's certificate authority data. Also, replace example-service-cidr with your cluster's service CIDR.

Related information

Troubleshoot problems with Amazon EKS clusters and nodes

How do I get my worker nodes to join my Amazon EKS cluster?