Passer au contenu

EKS Auto Mode starts with 0 nodes & can't schedule new nodes

0

When I create an EKS Auto Mode, I get a cluster with no nodes & pods. Then when I try to apply a simple yaml, the deployment fails with pods stuck in Pending state with FailedScheduling events.

Here's the commands I've run:

eksctl create cluster --enable-auto-mode=True  --name=my-name --region=my-region  --vpc-nat-mode=Disable --with-oidc=True
kubectl apply -f kubernetes_scale.yaml

And kubernetes_scale.yaml is very simple as well:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: "kubernetes-scaleup"
spec:
  progressDeadlineSeconds: 600
  replicas: 3
  selector:
    matchLabels:
      name: "kubernetes-scaleup"
  template:
    metadata:
      labels:
        name: "kubernetes-scaleup"
    spec:
      containers:
      - image: k8s.gcr.io/pause:3.1
        name: "kubernetes-scaleup"
        resources:
          requests:
            cpu: "250m"
            memory: "250M"
            ephemeral-storage: "10Mi"
          limits:
            cpu: "250m"
            memory: "250M"
            ephemeral-storage: "10Mi"
      terminationGracePeriodSeconds: 1
      # Add not-ready/unreachable tolerations for X seconds so that node
      # failure doesn't trigger pod deletion.
      tolerations:
      - key: "node.kubernetes.io/not-ready"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 600
      - key: "node.kubernetes.io/unreachable"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 600

This works on GKE & AKS with just the create command & setting max-nodes variable. I thought EKS Auto was supposed to be the easy mode to get this working quickly?

My nodeclaims (which I got with kubectl get nodeclaims -> kubectl describe nodeclaim generalname have status UNKNOWN & events like:

  Conditions:
    Last Transition Time:  2025-03-06T23:05:15Z
    Message:               object is awaiting reconciliation
    Observed Generation:   1
    Reason:                AwaitingReconciliation
    Status:                Unknown
    Type:                  Initialized
    Last Transition Time:  2025-03-06T23:05:15Z
    Message:               Node not registered with cluster
    Observed Generation:   1
    Reason:                NodeNotFound
    Status:                Unknown
    Type:                  Registered
    Last Transition Time:  2025-03-06T23:05:18Z
    Message:               
    Observed Generation:   1
    Reason:                Launched
    Status:                True
    Type:                  Launched
    Last Transition Time:  2025-03-06T23:05:15Z
    Message:               Initialized=Unknown, Registered=Unknown
    Observed Generation:   1
    Reason:                ReconcilingDependents
    Status:                Unknown
    Type:                  Ready
demandé il y a un an1,7 k vues
2 réponses
1
Réponse acceptée

Hello,

When you create an EKS Auto Mode cluster using the eksctl command you provided, the following occurs:

  1. A new VPC is created with public and private subnets.
  2. No NAT Gateway is set up due to the --vpc-nat-mode=Disable flag.
  3. An EKS cluster is created with public endpoint access.
  4. The cluster is associated with the private subnets.

When **only the public endpoint **access for the cluster is enabled, Kubernetes API requests that originate from within your cluster’s VPC (such as worker node to control plane communication) leave the VPC, but not Amazon’s network. In order for nodes to connect to the control plane, the worker node subnets **must have a route to an internet gateway or a route to a NAT gateway **where they can use the public IP address of the NAT gateway.

As the worker nodes unable to establish communication with the EKS control plane, they are unable to not join the cluster. The worker nodes will be launched, but will not join the cluster.



Therefore, to resolve this issue, create a NAT gateway, and attach it to the private subnets.

Else, if you don't want to use NAT Gatewauy, then you can also modify the cluster endpoint access to either "Public & Private" or only Private. When only the private endpoint is enabled, all traffic to your cluster API server must come from within your cluster’s VPC or a connected network. There is no public access to your API server from the internet. Any kubectl commands must come from within the VPC or a connected network. For a fully private cluster, VPC endpoints such as ec2, ecr.api, ecr.dkr, s3, & sts are required.

Reference: [1] https://repost.aws/knowledge-center/eks-worker-nodes-cluster

AWS
répondu il y a un an
EXPERT
vérifié il y a 10 mois
  • I can run kubectl commands like kubectl get nodeclaim, kubectl describe nodeclaim nodeclaimname just fine though; that's how I got the above events. Is this still applicable to the "nodes can't scale up" problem?

  • I reread your post & determined my comment was misunderstanding you. With only public enabled (what I was using) I should be able to access it with kubectl, but the nodes / pods themselves can't access as they need to. I gave my cluster public & private access with eksctl utils update-cluster-vpc-config --cluster=<clustername> --private-access=true --public-access=true --approve & now my scaleup is still not succeeding, but I believe I am getting a new error. Specifically my nodeclaims now have READY=TRUE rather than UNKNOWN & I have some ready nodes. My pods now have Status ImagePullBackOff rather than uhh whatever it was before, so they're still not Ready but that's the next problem. Thanks!

-1

The issue you're experiencing with EKS Auto Mode is likely related to a few factors:

  1. Node Scheduling: EKS Auto Mode starts with zero nodes by default and scales up as needed. When you apply your deployment, the system needs time to recognize the resource requirements and provision the necessary nodes.

  2. VPC NAT Mode: You've disabled NAT mode in your cluster creation command (--vpc-nat-mode=Disable). This could prevent nodes from properly communicating with the control plane and joining the cluster.

  3. NodePool Configuration: EKS Auto Mode requires a properly configured NodePool to manage node creation. Ensure you have a NodePool defined that matches your deployment's requirements.

To resolve this:

  1. Enable NAT mode when creating your cluster. Remove the --vpc-nat-mode=Disable flag from your eksctl command.

  2. After cluster creation, check if a default NodePool was created. If not, create one that matches your deployment's resource requirements.

  3. In your deployment YAML, add a nodeSelector to ensure pods are scheduled on Auto Mode nodes:

    spec:
      template:
        spec:
          nodeSelector:
            eks.amazonaws.com/compute-type: ec2-auto
  4. Allow more time for node provisioning. EKS Auto Mode may take a few minutes to recognize the need for nodes and create them.

  5. Check the status of your NodePool and any NodeClaims:

    kubectl get nodepools
    kubectl get nodeclaims
    
  6. Review the events in your cluster for any errors:

    kubectl get events --sort-by=.metadata.creationTimestamp
    

If problems persist, ensure your IAM roles and policies are correctly set up for EKS Auto Mode, and that there are no issues with your VPC or subnet configurations.

Remember, EKS Auto Mode is designed for ease of use, but it still requires proper configuration and may have a slight delay in initial node provisioning compared to pre-provisioned node groups.

Sources
Troubleshoot EKS Auto Mode - Amazon EKS
How to debug Auto-Mode custom NodePool | AWS re:Post

répondu il y a un an

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.