How can I troubleshoot Amazon EKS pods on AWS Fargate that are stuck in a Pending state?

5 minute read
0

My Amazon Elastic Kubernetes Service (Amazon EKS) pods that are running on AWS Fargate instances are stuck in a Pending state.

Short description

Here are some common scenarios that cause pods to remain stuck in the Pending state on Amazon EKS using AWS Fargate:

  • There's a capacity error because a particular vCPU and memory combination is unavailable.
  • You created the CoreDNS pods with a default annotation that maps them to the Amazon Elastic Compute Cloud (Amazon EC2) compute type. To schedule them on a Fargate node, remove the Amazon EC2 compute type.
  • The pod didn't match any Fargate profiles when you created it and isn't assigned to the fargate-scheduler. If a pod isn't matched on creation, then it isn't automatically rescheduled to Fargate nodes. This is true even if you create a matching profile later. In this case, the pod is assigned to the default-scheduler.
  • If the pod is assigned to the fargate-scheduler but remains in a Pending state, then the pod might require additional troubleshooting.

Before troubleshooting, note the Fargate following pod rules:

  • You must configure namespace and match labels for your pod selectors. Fargate workflow matches pods to a Fargate profile only if both conditions match the pod specification.
  • If you specify multiple pod selectors within a single Fargate profile, then fargate-scheduler schedules the pod if it matches any of the selectors.
  • If a pod specification matches with multiple Fargate profiles, then the pod is scheduled according to a random Fargate profile. To avoid this, you can use the annotation eks.amazonaws.com/fargate-profile:fp_name within the pod specification.
    Note: Replace fp_name with your Fargate profile name.

Resolution

Important: The following steps apply only to pods launched with AWS Fargate. For information on pods launched on Amazon EC2 instances, see How can I troubleshoot the pod status in Amazon EKS?

Find out the status of your pod

1.    Run the following command to check your pod state:

kubectl get pods -n <namespace>

2.    To get more error information about your pod, run the following describe command:

kubectl describe pod YOUR\_POD\_NAME -n <namespace>

Refer to the output of the describe command to evaluate which of the following resolutions will help troubleshoot your issue.

Resolving capacity error

If your pods have a capacity issue, then the describe output is similar to the following:

Fargate capacity is unavailable at this time. Please try again later or in a different availability zone

This output means that Fargate can't provision compute capacity based on the vCPU and memory combination that you selected.

To resolve the error:

  • Retry creating the pod after 15-20 minutes. Because the error is capacity-based, the exact amount of time can vary.
  • Change the request (CPU and memory) within your pod specification (from the Kubernetes website). The Fargate workflow then provisions a new combination of vCPU and memory.
    Note: You're billed based on one of your combinations. For more information on how the combination is finalized based on your pod specification, see Pod CPU and memory. Performing a kubectl describe node command from your terminal/IDE can give you a much higher vCPU and memory combination value. Fargate doesn't always have available capacity based on your requests and provisions resources from a capacity pool on a best effort basis. However, you're billed only for pod usage and equivalent vCPU and memory combination.

Resolving CoreDNS pods in a Pending state

If CoreDNS pods are in a Pending state, then you see an output that's similar to the following message:

kubectl get pods -n kube-system  
NAME                                     READY   STATUS     RESTARTS      AGE
coredns-6548845887-qk9vf                 0/1     Pending    0             157m

This might be because CoreDNS deployment has the following default annotation: eks.amazonaws.com/compute-type : ec2.

To resolve this and re-assign the pods to the Fargate scheduler, see Update CoreDNS.

Troubleshooting pods assigned to fargate-scheduler

There are multiple reasons why pods assigned to fargate-scheduler might be stuck in Pending state, ranging from misconfiguration of pod annotation to networking issues. If your pods remain in a Pending state, then the describe output is similar to the following message:

Events:
Type       Reason              Age                     From     
----       ------              ----                    ----     
Warning    FailedScheduling    2m25s (x301 over 5h3m)  fargate-scheduler

To troubleshoot this error:

  • Delete and recreate the pods.
  • Confirm the following specifications aren't set in the pod specification YAML:
    node selector
    <>node name
    schedulerName
    These specifications cause the fargate-scheduler to skip the pod.
  • Confirm that the subnets selected in your Fargate profile have enough free IP addresses to create new pods. Each Fargate node consumes one IP address from the subnet.
  • Confirm that the NAT gateway is set to a public subnet, and has an Elastic IP address attached to it.
  • Confirm that the DHCP option sets that are associated with your virtual private cloud (VPC) have an AmazonProvidedDNS or a valid DNS server hostname for domain-name-servers.
  • Confirm that DNS hostnames and DNS resolution is turned on for your VPC.
  • If your Fargate pods use private subnets with only VPC endpoints configured for service communication, then you must allow these endpoints with DNS names:
    ECR - API
    ECR - DKR
    S3 Gateway endpoint
  • Confirm the security group that's attached to the VPC endpoint allows communication from Fargate to and from the API server. The VPC endpoint security group must allow port 443 ingress from the cluster VPC CIDR. You must also turn on private endpoint access for your cluster.

Resolving pods assigned to default-scheduler

To determine the scheduler that your pods are assigned to, run the following command:

kubectl get pods -o yaml -n <namespace> <pod-name> | grep schedulerName.

In the output, confirm that the schedulerName is fargate-scheduler. If it's listed as default-scheduler, then the fargate-scheduler skipped this pod. To troubleshoot this issue, check your pod configuration for compute-type annotations and refer to AWS Fargate considerations.

AWS OFFICIAL
AWS OFFICIALUpdated 2 months ago