- Newest
- Most votes
- Most comments
Greeting
Hi Ravindar!
Thanks for sharing the details of your issue and the Terraform configuration. It sounds like you're encountering a frustrating problem with self-managed node groups not joining your EKS cluster, leading to a "DEGRADED" CoreDNS add-on status. Let’s break this down and work toward a resolution. 😊
Clarifying the Issue
From your description, you're using Amazon Linux 2023 EKS-optimized AMIs to launch self-managed node groups. While the managed node groups work perfectly with your existing Terraform module, the self-managed node groups fail to join the cluster, causing CoreDNS to remain in a "DEGRADED" state. Additionally, you’ve shared detailed Terraform configurations and mentioned that you’ve already tried using cloudinit_pre_nodeadm. This indicates a likely misconfiguration related to bootstrap setup, IAM roles, networking, or deployment timing.
This issue impacts your ability to use self-managed node groups effectively, which are vital for controlling costs and implementing custom configurations. Let's explore the steps to resolve this! 🚀
Why This Matters
Self-managed node groups provide flexibility and cost efficiency compared to managed node groups. Resolving this issue will enable you to fully leverage self-managed nodes in your EKS cluster while ensuring critical services like CoreDNS function correctly. This is crucial for cluster stability and operational success.
Key Terms
- CoreDNS: A DNS and service discovery solution for Kubernetes clusters.
- IAM Role: Permissions assigned to AWS resources to access other services securely.
- Self-Managed Node Groups: EC2 instances managed outside the default AWS-managed node group setup for EKS.
- CloudInit: A tool for configuring EC2 instances during boot.
- EKS Add-Ons: Pre-configured software components deployed within an EKS cluster.
The Solution (Our Recipe)
Steps at a Glance:
- Verify IAM Role and Permissions.
- Debug CloudInit and Bootstrap.
- Adjust Security Groups and Networking.
- Modify EKS Add-On Deployment Order.
- Pin Compatible CoreDNS Versions.
- Increase EKS Add-On Timeout Period.
- Check Node and CoreDNS Logs.
Step-by-Step Guide:
1. Verify IAM Role and Permissions:
Ensure that the IAM role attached to your self-managed node group includes the following policies:
resource "aws_iam_role_policy_attachment" "eks_node" { policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy" role = aws_iam_role.eks_node_role.name } resource "aws_iam_role_policy_attachment" "ec2_container_registry" { policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly" role = aws_iam_role.eks_node_role.name } resource "aws_iam_role_policy_attachment" "eks_cni" { policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy" role = aws_iam_role.eks_node_role.name }
Without these policies, nodes cannot fetch required configurations or communicate with the control plane.
2. Debug CloudInit and Bootstrap:
- Log into one of the EC2 instances and check
/var/log/cloud-init.logand/var/log/cloud-init-output.logfor errors. - Focus on bootstrap configuration for
apiServerEndpoint,certificateAuthority, andkubeletsettings.
Example debugging commands:
cat /var/log/cloud-init.log cat /var/log/cloud-init-output.log
Pro Tip: Look for errors indicating certificate mismatches or missing credentials, which often point to IAM or networking misconfigurations.
3. Adjust Security Groups and Networking:
- Ensure security group rules allow the following:
- Inbound/outbound traffic on port 443 for EKS control plane communication.
- Pod-to-pod communication on ports 1025-65535.
Example Terraform configuration:
resource "aws_security_group_rule" "eks_ingress" { type = "ingress" from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] }
- Verify subnet tagging:
"kubernetes.io/cluster/<cluster-name>" = "owned""kubernetes.io/role/internal-elb" = 1
Missing tags can prevent node registration.
4. Modify EKS Add-On Deployment Order:
Ensure CoreDNS deploys after nodes are ready by using depends_on:
resource "aws_eks_addon" "coredns" { cluster_name = module.eks.cluster_name addon_name = "coredns" depends_on = [module.eks.self_managed_node_group] }
5. Pin Compatible CoreDNS Versions:
Specify a compatible CoreDNS version (v1.11.2 for Kubernetes 1.30):
cluster_addons = { coredns = { addon_version = "v1.11.2-eksbuild.1" } }
6. Increase EKS Add-On Timeout Period:
Add a timeout configuration to avoid premature failure:
timeouts { create = "30m" }
7. Check Node and CoreDNS Logs:
Use the following commands to inspect node and pod statuses:
kubectl get nodes kubectl get pods -n kube-system kubectl logs <coredns-pod-name> -n kube-system kubectl describe pod <coredns-pod-name> -n kube-system
Check for issues like:
"CrashLoopBackOff"or"FailedScheduling"in CoreDNS pods.- Nodes stuck in
"NotReady"state.
Closing Thoughts
This step-by-step guide should help you identify and resolve the root cause of the "DEGRADED" CoreDNS status. For more detailed guidance, refer to the following AWS documentation:
Farewell
Ravindar, I hope these steps bring your self-managed node groups and CoreDNS into a healthy state. Let me know how it goes or if you need further assistance. I'm happy to help! 😊🚀
Relevant content
- AWS OFFICIALUpdated 9 months ago
