Skip to content

Self-Managed Node Groups Not Joining EKS Cluster (CoreDNS 'DEGRADE' Error)

1

Iam attempting to create a self managed node groups to launch EC2 instances using Amazon Linux 2023 EKS optimized AMI. However. I am encountering an issue where the node groups are not joining the cluster, which is resulting in a 'DEGRADE' error for CoreDNS.

When I use the same Terraform code and eks module to create an EKS cluster with managed node groups, it works perfectly, with no issues related to node joining or CoreDNS.

This appears to be a bug. Is there a workaround to resolve this problem by modifying the Terraform code? Any suggestions or advice would be greatly appreciated.

Error: waiting for EKS Add-On (ecp-ppp-prod:coredns) create: timeout while waiting for state to become 'ACTIVE' (last state: 'DEGRADED', timeout: 20m0s) │ │ with module.eks.aws_eks_addon.this["coredns"], │ on .terraform/modules/eks/main.tf line 498, in resource "aws_eks_addon" "this": │ 498: resource "aws_eks_addon" "this" { │

Here is the Terraform modules and Reproduction Code

`module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.13"

  cluster_name                   = local.name
  cluster_version                = local.cluster_version 

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets  
  control_plane_subnet_ids = module.vpc.intra_subnets

  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = true

  enable_irsa = true

  enable_cluster_creator_admin_permissions = true

  # This will set the cluster authentication use API and CONFIG MAP, EKS will automatically create an access entry for the IAM role(s) used by managed nodegroup(s)
  authentication_mode = "API_AND_CONFIG_MAP" 


  # EKS Addons
  cluster_addons = {
    coredns    = {
      most_recent = true
    }

    eks-pod-identity-agent = {
      most_recent = true
    }

    kube-proxy = {
      most_recent = true
    }
    vpc-cni = {
      # Specify the VPC CNI addon should be deployed before compute to ensure
      # the addon is configured before data plane compute resources are created
      # See README for further details
      before_compute = true
      most_recent    = true # To ensure access to the latest settings provided
      configuration_values = jsonencode({
        env = {
          # Reference docs https://docs.aws.amazon.com/eks/latest/userguide/cni-increase-ip-addresses.html
          ENABLE_PREFIX_DELEGATION = "true"
          WARM_PREFIX_TARGET       = "1"
        }
      })
    }
  } 
 

  self_managed_node_groups = {
    # AL2023 node group utilizing new user data format which utilizes nodeadm
    # to join nodes to the cluster (instead of /etc/eks/bootstrap.sh)
    al2023_nodeadm = {

      name = "cis-self-mng"
      use_name_prefix = true
      launch_template_description     = "Self managed node group example launch template"

      # ebs_optimized     = true
      enable_monitoring = true

      subnet_ids = module.vpc.public_subnets

      min_size         = 1
      max_size         = 3
      desired_capacity = 1

      instance_type = "m6i.large"

      enable_bootstrap_user_data = true
      is_eks_managed_node_group = false

      ami_id = data.aws_ami.image_cis_eks.id 
 

      launch_template_name            = "amazon-eks-al2023-node-1.30"
      launch_template_use_name_prefix = true 
      launch_template_description     = "amazon-eks-al2023-node-1.30"

   
      // The following variables are necessary if you decide to use the module outside of the parent EKS module context.
      // Without it, the security groups of the nodes are empty and thus won't join the cluster.
      vpc_security_group_ids = [
        module.eks.cluster_primary_security_group_id,
        module.eks.cluster_security_group_id,
      ]
 
      # AL2023 node group utilizing new user data format which utilizes nodeadm
      # to join nodes to the cluster (instead of /etc/eks/bootstrap.sh)
      cloudinit_pre_nodeadm = [
        {
          content_type = "application/node.eks.aws"
          content      = <<-EOT
            ---
            apiVersion: node.eks.aws/v1alpha1
            kind: NodeConfig
            spec:
              featureGates: 
                InstanceIdNodeName: true
              cluster: ecp-ppp-prod
              apiServerEndpoint: https://aaa.us-east-1.eks.amazonaws.com
              certificateAuthority: aaa
              cidr: 1aa.bb.0.0/16
              kubelet:
                config:
                  shutdownGracePeriod: 30s
                  featureGates:
                    DisableKubeletCloudCredentialProviders: true 
              config: |
              [plugins."io.containerd.grpc.v1.cri".containerd]
              discard_unpacked_layers = false            
          EOT
        }
      ]
    } 
  } 

  tags = local.tags 
} `

And i even tried with self managed module however getting the same issue.

module "self_managed_node_group" {
  source = "terraform-aws-modules/eks/aws//modules/self-managed-node-group"
  version = "20.13.1"

  name                = "cis-self-mng"
  cluster_name        = "aaa-ppp-prod"
  cluster_version     = "1.30"
  cluster_endpoint    = "https://aa.gr7.us-east-1.eks.amazonaws.com"
  cluster_auth_base64 = "bb"
  cluster_ip_family    = "ipv4"
  cluster_service_cidr = "aa.bb.0.10"

  subnet_ids = module.vpc.private_subnets


  ami_id   = data.aws_ami.image_cis_eks.id

  user_data_template_path = "${path.module}/modules/user_data/templates/al2023_custom.tpl"

  cloudinit_pre_nodeadm = [{
    content      = <<-EOT
      ---
      apiVersion: node.eks.aws/v1alpha1
      kind: NodeConfig
      spec:
        kubelet:
          config:
            shutdownGracePeriod: 30s
            featureGates:
              DisableKubeletCloudCredentialProviders: true
    EOT
    content_type = "application/node.eks.aws"
  }]

  cloudinit_post_nodeadm = [{
    content      = <<-EOT
      echo "All done"
    EOT
    content_type = "text/x-shellscript; charset=\"us-ascii\""
  }]

  // The following variables are necessary if you decide to use the module outside of the parent EKS module context.
  // Without it, the security groups of the nodes are empty and thus won't join the cluster.
  vpc_security_group_ids = [
    module.eks.cluster_primary_security_group_id,
    module.eks.cluster_security_group_id,
  ]

  min_size     = 1
  max_size     = 4
  desired_size = 1

  launch_template_name   = "cis-self-mng"
  instance_type          = "m5.2xlarge"

  tags = {
    Environment = "ppp-prod"
    Terraform   = "true"
  }
}

This is my VPC supporting TF module

################################################################################

VPC supportings

################################################################################

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0.0"

  name = local.name
  cidr = local.vpc_cidr

  azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets = ["10.aa.aa.0/24", "10.aa.aa.0/24", "10.aa.bb.0/24"]
  public_subnets  = ["10.aa.16.aa/26", "10.cc.dd.128/26", "10.dd.ee.1ff![![![![Enter image description here](/media/postImages/original/IMLiInYSrVSD6DomEd9MK5-w)
Enter image description here](/media/postImages/original/IM_8p8_BYQTTy_oxui2_zRjA)
Enter image description here](/media/postImages/original/IM2cOdzz2HQtaLXw1Gn4v5bg)
Enter image description here](/media/postImages/original/IMLi54e6AaRO6xuY-S9vY9MQ)
/26"]

  enable_nat_gateway     = true
  create_igw             = true

  single_nat_gateway     = false
  one_nat_gateway_per_az = false

  enable_dns_hostnames   = true
  enable_dns_support     = true

  enable_flow_log                      = true
  create_flow_log_cloudwatch_iam_role  = true
  create_flow_log_cloudwatch_log_group = true


  public_subnet_tags = {
    "kubernetes.io/role/elb"                        = 1
    "kubernetes.io/cluster/${var.environment_name}" = "owned"
  }

  private_subnet_tags = {
    "kubernetes.io/role/internal-elb" = 1
    # Tags subnets for Karpenter auto-discovery
    # "karpenter.sh/discovery" = local.name

    "kubernetes.io/cluster/${var.environment_name}" = "owned"

  }

  tags = local.tags

}

Enter image description here
Enter image description here Enter image description here Enter image description here

1 Answer
0

Greeting

Hi Ravindar!

Thanks for sharing the details of your issue and the Terraform configuration. It sounds like you're encountering a frustrating problem with self-managed node groups not joining your EKS cluster, leading to a "DEGRADED" CoreDNS add-on status. Let’s break this down and work toward a resolution. 😊


Clarifying the Issue

From your description, you're using Amazon Linux 2023 EKS-optimized AMIs to launch self-managed node groups. While the managed node groups work perfectly with your existing Terraform module, the self-managed node groups fail to join the cluster, causing CoreDNS to remain in a "DEGRADED" state. Additionally, you’ve shared detailed Terraform configurations and mentioned that you’ve already tried using cloudinit_pre_nodeadm. This indicates a likely misconfiguration related to bootstrap setup, IAM roles, networking, or deployment timing.

This issue impacts your ability to use self-managed node groups effectively, which are vital for controlling costs and implementing custom configurations. Let's explore the steps to resolve this! 🚀


Why This Matters

Self-managed node groups provide flexibility and cost efficiency compared to managed node groups. Resolving this issue will enable you to fully leverage self-managed nodes in your EKS cluster while ensuring critical services like CoreDNS function correctly. This is crucial for cluster stability and operational success.


Key Terms

  • CoreDNS: A DNS and service discovery solution for Kubernetes clusters.
  • IAM Role: Permissions assigned to AWS resources to access other services securely.
  • Self-Managed Node Groups: EC2 instances managed outside the default AWS-managed node group setup for EKS.
  • CloudInit: A tool for configuring EC2 instances during boot.
  • EKS Add-Ons: Pre-configured software components deployed within an EKS cluster.

The Solution (Our Recipe)

Steps at a Glance:

  1. Verify IAM Role and Permissions.
  2. Debug CloudInit and Bootstrap.
  3. Adjust Security Groups and Networking.
  4. Modify EKS Add-On Deployment Order.
  5. Pin Compatible CoreDNS Versions.
  6. Increase EKS Add-On Timeout Period.
  7. Check Node and CoreDNS Logs.

Step-by-Step Guide:

1. Verify IAM Role and Permissions:
Ensure that the IAM role attached to your self-managed node group includes the following policies:

resource "aws_iam_role_policy_attachment" "eks_node" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
  role       = aws_iam_role.eks_node_role.name
}
resource "aws_iam_role_policy_attachment" "ec2_container_registry" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
  role       = aws_iam_role.eks_node_role.name
}
resource "aws_iam_role_policy_attachment" "eks_cni" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
  role       = aws_iam_role.eks_node_role.name
}

Without these policies, nodes cannot fetch required configurations or communicate with the control plane.


2. Debug CloudInit and Bootstrap:

  • Log into one of the EC2 instances and check /var/log/cloud-init.log and /var/log/cloud-init-output.log for errors.
  • Focus on bootstrap configuration for apiServerEndpoint, certificateAuthority, and kubelet settings.

Example debugging commands:

cat /var/log/cloud-init.log
cat /var/log/cloud-init-output.log

Pro Tip: Look for errors indicating certificate mismatches or missing credentials, which often point to IAM or networking misconfigurations.


3. Adjust Security Groups and Networking:

  • Ensure security group rules allow the following:
    • Inbound/outbound traffic on port 443 for EKS control plane communication.
    • Pod-to-pod communication on ports 1025-65535.

Example Terraform configuration:

resource "aws_security_group_rule" "eks_ingress" {
  type        = "ingress"
  from_port   = 443
  to_port     = 443
  protocol    = "tcp"
  cidr_blocks = ["0.0.0.0/0"]
}
  • Verify subnet tagging:
    • "kubernetes.io/cluster/<cluster-name>" = "owned"
    • "kubernetes.io/role/internal-elb" = 1

Missing tags can prevent node registration.


4. Modify EKS Add-On Deployment Order:
Ensure CoreDNS deploys after nodes are ready by using depends_on:

resource "aws_eks_addon" "coredns" {
  cluster_name = module.eks.cluster_name
  addon_name   = "coredns"

  depends_on = [module.eks.self_managed_node_group]
}

5. Pin Compatible CoreDNS Versions:
Specify a compatible CoreDNS version (v1.11.2 for Kubernetes 1.30):

cluster_addons = {
  coredns = {
    addon_version = "v1.11.2-eksbuild.1"
  }
}

6. Increase EKS Add-On Timeout Period:
Add a timeout configuration to avoid premature failure:

timeouts {
  create = "30m"
}

7. Check Node and CoreDNS Logs:
Use the following commands to inspect node and pod statuses:

kubectl get nodes
kubectl get pods -n kube-system
kubectl logs <coredns-pod-name> -n kube-system
kubectl describe pod <coredns-pod-name> -n kube-system

Check for issues like:

  • "CrashLoopBackOff" or "FailedScheduling" in CoreDNS pods.
  • Nodes stuck in "NotReady" state.

Closing Thoughts

This step-by-step guide should help you identify and resolve the root cause of the "DEGRADED" CoreDNS status. For more detailed guidance, refer to the following AWS documentation:


Farewell

Ravindar, I hope these steps bring your self-managed node groups and CoreDNS into a healthy state. Let me know how it goes or if you need further assistance. I'm happy to help! 😊🚀

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.