Skip to content

NodeCreationFailure: Instances failed to join the kubernetes cluster

0

I am struggling badly with creating private cluster in EKS. i tried alomost everyting. took help of chatgpt and grok but still this issue could not be resolved. could you please guide me what is the issue in my script?

#!/bin/bash
set -e  # Exit on command failure

# Configuration
AWS_REGION="us-east-1"
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
VPC_NAME="new-eks-vpc"
CLUSTER_NAME="new-eks-cluster"
NODEGROUP_NAME="new-eks-workers"
CLUSTER_SG_NAME="new-eks-cluster-sg"
NODE_SG_NAME="new-eks-node-sg"
ENDPOINT_SG_NAME="new-eks-endpoint-sg"
echo "AWS Account: $AWS_ACCOUNT_ID | Region: $AWS_REGION"

# 1. IAM Roles
echo "Creating IAM roles..."

# EKS Cluster Role
EKS_CLUSTER_ROLE="NewEKSClusterRole"
if aws iam get-role --role-name "$EKS_CLUSTER_ROLE" >/dev/null 2>&1; then
    echo "$EKS_CLUSTER_ROLE already exists."
else
    aws iam create-role --role-name "$EKS_CLUSTER_ROLE" --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"eks.amazonaws.com"},"Action":"sts:AssumeRole"}]}'
    aws iam attach-role-policy --role-name "$EKS_CLUSTER_ROLE" --policy-arn arn:aws:iam::aws:policy/AmazonEKSClusterPolicy
fi

# EKS Node Role
EKS_NODE_ROLE="NewEKSNodeRole"
if aws iam get-role --role-name "$EKS_NODE_ROLE" >/dev/null 2>&1; then
    echo "$EKS_NODE_ROLE already exists."
else
    aws iam create-role --role-name "$EKS_NODE_ROLE" --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"ec2.amazonaws.com"},"Action":"sts:AssumeRole"}]}'
    aws iam attach-role-policy --role-name "$EKS_NODE_ROLE" --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
    aws iam attach-role-policy --role-name "$EKS_NODE_ROLE" --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
    aws iam attach-role-policy --role-name "$EKS_NODE_ROLE" --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
fi
sleep 10  # Wait for IAM propagation

# 2. VPC & Networking
echo "Setting up VPC..."

VPC_ID=$(aws ec2 create-vpc --cidr-block 10.1.0.0/16 --query 'Vpc.VpcId' --output text)
aws ec2 create-tags --resources "$VPC_ID" --tags Key=Name,Value="$VPC_NAME"
aws ec2 modify-vpc-attribute --vpc-id "$VPC_ID" --enable-dns-support
aws ec2 modify-vpc-attribute --vpc-id "$VPC_ID" --enable-dns-hostnames

# DHCP Options with AmazonProvidedDNS
DHCP_OPTIONS_ID=$(aws ec2 create-dhcp-options --dhcp-configuration "Key=domain-name-servers,Values=AmazonProvidedDNS" --query 'DhcpOptions.DhcpOptionsId' --output text)
aws ec2 associate-dhcp-options --dhcp-options-id "$DHCP_OPTIONS_ID" --vpc-id "$VPC_ID"

# Subnets (two private subnets for HA)
SUBNET1=$(aws ec2 create-subnet --vpc-id "$VPC_ID" --cidr-block 10.1.1.0/24 --availability-zone "us-east-1a" --query 'Subnet.SubnetId' --output text)
aws ec2 create-tags --resources "$SUBNET1" --tags Key=Name,Value=new-eks-subnet-1 "Key=kubernetes.io/role/internal-elb,Value=1" "Key=kubernetes.io/cluster/$CLUSTER_NAME,Value=shared"

SUBNET2=$(aws ec2 create-subnet --vpc-id "$VPC_ID" --cidr-block 10.1.2.0/24 --availability-zone "us-east-1b" --query 'Subnet.SubnetId' --output text)
aws ec2 create-tags --resources "$SUBNET2" --tags Key=Name,Value=new-eks-subnet-2 "Key=kubernetes.io/role/internal-elb,Value=1" "Key=kubernetes.io/cluster/$CLUSTER_NAME,Value=shared"

# Route Table
RTB_ID=$(aws ec2 create-route-table --vpc-id "$VPC_ID" --query 'RouteTable.RouteTableId' --output text)
aws ec2 associate-route-table --route-table-id "$RTB_ID" --subnet-id "$SUBNET1"
aws ec2 associate-route-table --route-table-id "$RTB_ID" --subnet-id "$SUBNET2"

# S3 Gateway Endpoint
echo "Creating S3 Gateway Endpoint..."
S3_ENDPOINT_ID=$(aws ec2 create-vpc-endpoint \
  --vpc-id "$VPC_ID" \
  --service-name "com.amazonaws.$AWS_REGION.s3" \
  --route-table-ids "$RTB_ID" \
  --vpc-endpoint-type Gateway \
  --query 'VpcEndpoint.VpcEndpointId' --output text)

# Security Groups
echo "Creating security groups..."

# Cluster Security Group
CLUSTER_SG_ID=$(aws ec2 create-security-group --group-name "$CLUSTER_SG_NAME" --description "New EKS Cluster SG" --vpc-id "$VPC_ID" --query 'GroupId' --output text)

# Node Security Group
NODE_SG_ID=$(aws ec2 create-security-group --group-name "$NODE_SG_NAME" --description "New EKS Node SG" --vpc-id "$VPC_ID" --query 'GroupId' --output text)

# Endpoint Security Group
ENDPOINT_SG_ID=$(aws ec2 create-security-group --group-name "$ENDPOINT_SG_NAME" --description "New VPC Endpoint SG" --vpc-id "$VPC_ID" --query 'GroupId' --output text)
aws ec2 authorize-security-group-ingress --group-id "$ENDPOINT_SG_ID" --protocol tcp --port 443 --source-group "$NODE_SG_ID"

# Security Group Rules
echo "Configuring security group rules..."

# Cluster SG: Allow all from Node SG
aws ec2 authorize-security-group-ingress --group-id "$CLUSTER_SG_ID" --protocol all --port -1 --source-group "$NODE_SG_ID"

# Node SG: Allow all from Cluster SG
aws ec2 authorize-security-group-ingress --group-id "$NODE_SG_ID" --protocol all --port -1 --source-group "$CLUSTER_SG_ID"

# Node SG: Allow self-referential traffic for node-to-node communication
aws ec2 authorize-security-group-ingress --group-id "$NODE_SG_ID" --protocol all --port -1 --source-group "$NODE_SG_ID"

# Node SG: Allow DNS (TCP/UDP 53) within VPC
aws ec2 authorize-security-group-ingress --group-id "$NODE_SG_ID" --protocol tcp --port 53 --cidr 10.1.0.0/16
aws ec2 authorize-security-group-ingress --group-id "$NODE_SG_ID" --protocol udp --port 53 --cidr 10.1.0.0/16

# VPC Endpoints
echo "Creating VPC Endpoints..."
aws ec2 create-vpc-endpoint --vpc-id "$VPC_ID" --service-name "com.amazonaws.$AWS_REGION.eks" --vpc-endpoint-type Interface --subnet-ids "$SUBNET1" "$SUBNET2" --security-group-ids "$ENDPOINT_SG_ID" --private-dns-enabled
aws ec2 create-vpc-endpoint --vpc-id "$VPC_ID" --service-name "com.amazonaws.$AWS_REGION.ecr.api" --vpc-endpoint-type Interface --subnet-ids "$SUBNET1" "$SUBNET2" --security-group-ids "$ENDPOINT_SG_ID" --private-dns-enabled
aws ec2 create-vpc-endpoint --vpc-id "$VPC_ID" --service-name "com.amazonaws.$AWS_REGION.ecr.dkr" --vpc-endpoint-type Interface --subnet-ids "$SUBNET1" "$SUBNET2" --security-group-ids "$ENDPOINT_SG_ID" --private-dns-enabled

# 3. EKS Cluster & Node Group
echo "Creating EKS cluster: $CLUSTER_NAME..."
aws eks create-cluster --name "$CLUSTER_NAME" --role-arn "arn:aws:iam::$AWS_ACCOUNT_ID:role/$EKS_CLUSTER_ROLE" --resources-vpc-config "subnetIds=$SUBNET1,$SUBNET2,securityGroupIds=$CLUSTER_SG_ID,endpointPublicAccess=false,endpointPrivateAccess=true" --region "$AWS_REGION"
aws eks wait cluster-active --name "$CLUSTER_NAME" --region "$AWS_REGION" || { echo "Cluster creation failed."; exit 1; }

echo "Creating node group: $NODEGROUP_NAME..."
aws eks create-nodegroup --cluster-name "$CLUSTER_NAME" --nodegroup-name "$NODEGROUP_NAME" --subnets "$SUBNET1" "$SUBNET2" --node-role "arn:aws:iam::$AWS_ACCOUNT_ID:role/$EKS_NODE_ROLE" --scaling-config minSize=1,maxSize=3,desiredSize=2 --instance-types t3.medium --region "$AWS_REGION"
aws eks wait nodegroup-active --cluster-name "$CLUSTER_NAME" --nodegroup-name "$NODEGROUP_NAME" --region "$AWS_REGION" || {
    echo "Node group failed. Check issues:"
    aws eks describe-nodegroup --cluster-name "$CLUSTER_NAME" --nodegroup-name "$NODEGROUP_NAME" --query 'nodegroup.health.issues'
    exit 1
}

echo "Setup complete! Cluster and nodes are active."

below is the continous error I am getting irrespevtive of all the steps. I am quite knew so not sure what mistake I am doing.

Waiter NodegroupActive failed: Waiter encountered a terminal failure state: For expression "nodegroup.status" we matched expected path: "CREATE_FAILED"
Node group failed. Check issues:
[
    {
        "code": "NodeCreationFailure",
        "message": "Instances failed to join the kubernetes cluster",
        "resourceIds": [
            "i-09d1cb898f3c82fe1",
            "i-0ce35ab379f627cf9"
        ]
    }
]

2 Answers
1

The issue you're experiencing with nodes failing to join the EKS cluster could be due to several factors. Here are some potential causes and solutions:

  1. DHCP Options: Ensure that your VPC has the correct DHCP options set. Your script creates DHCP options, but make sure they include both domain-name and domain-name-servers. The domain-name should be set to the region-specific compute.internal value (e.g., ec2.internal for us-east-1).

  2. DNS Resolution: Verify that DNS hostnames and DNS resolution are enabled for your VPC. Your script enables these, but double-check to ensure they're properly set.

  3. Security Group Rules: Review your security group configurations. Ensure that the node security group allows all traffic from the cluster security group and itself. Your script seems to set this up correctly, but verify that these rules are in place.

  4. VPC Endpoints: Confirm that you have the necessary VPC endpoints set up correctly. Your script creates endpoints for EKS, ECR API, and ECR DKR, which is good. Make sure these endpoints are associated with the correct subnets and security groups.

  5. IAM Roles: Verify that the IAM roles for both the cluster and nodes have the correct permissions. Your script attaches the necessary policies, but ensure they are properly propagated before creating the cluster and nodes.

  6. Subnet Configuration: Check that your subnets are properly tagged for EKS use. Your script adds the required tags, but verify they are correct.

  7. Bootstrap Script: Although not explicitly shown in your script, ensure that the node UserData includes the correct bootstrap script with the proper cluster name.

  8. Network Connectivity: Verify that the nodes can reach the cluster API server. Since you're using a private cluster, ensure that the VPC endpoints are correctly set up to allow this communication.

  9. AWS CLI Version: Make sure you're using the latest version of the AWS CLI, as older versions might not support all the required options for EKS.

To troubleshoot further:

  1. Check the EC2 instance logs for the failed nodes. You can access these through the EC2 console or using Systems Manager Session Manager if configured.

  2. Review the CloudWatch logs for the EKS control plane if you have logging enabled.

  3. Try to manually join a node to the cluster to see if you can identify any specific errors during the join process.

  4. Consider enabling EKS control plane logging to get more detailed information about cluster operations.

If the issue persists after checking these areas, you may need to engage AWS support for further assistance, providing them with the detailed logs and the steps you've taken to troubleshoot.
Sources
Adding Nodes in EKS getting >>nodecreationfailure eks - Nodes instances failed to join the kubernetes cluster | AWS re:Post
EKS node creation failed | AWS re:Post
Using custom AMI ID for EKS Nodegroup | AWS re:Post
Troubleshoot problems with Amazon EKS clusters and nodes - Amazon EKS

answered 8 months ago
EXPERT
reviewed 8 months ago
0

Quick look, script appears fine except;


  • and the above being the case, if default outbound rule is not restricted, the cluster DNS need should be covered in your node-to-node connection (port 53 udp/tcp), and your node-to-cluster (port 443 for getting cluster dns information), so this hardocded cidr might be made redundant
# Node SG: Allow DNS (TCP/UDP 53) within VPC
aws ec2 authorize-security-group-ingress --group-id "$NODE_SG_ID" --protocol tcp --port 53 --cidr 10.1.0.0/16
aws ec2 authorize-security-group-ingress --group-id "$NODE_SG_ID" --protocol udp --port 53 --cidr 10.1.0.0/16

answered 8 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.