How can I troubleshoot the pod status in Amazon EKS?
My Amazon Elastic Kubernetes Service (Amazon EKS) pods that are running on Amazon Elastic Compute Cloud (Amazon EC2) instances or on a managed node group are stuck. I want to get my pods in the Running state.
Resolution
Important: The following steps apply only to pods launched on Amazon EC2 instances or a managed node group. These steps don't apply to pods launched on AWS Fargate.
Find out the status of your pod
1. To get the status of your pod, run the following command:
$ kubectl get pod
2. To get information from the Events history of your pod, run the following command:
$ kubectl describe pod YOUR_POD_NAME
Note: The example commands covered in the following steps are in the default namespace. For other namespaces, append the command with -n YOURNAMESPACE.
3. Based on the status of your pod, complete the steps in one of the following sections: Your pod is in the Pending state, Your pod is in the Waiting state, or Your pod is in the CrashLoopBackOff state.
Your pod is in the Pending state
Pods in the Pending state can't be scheduled onto a node. This can occur due to insufficient resources or with the use of hostPort. For more information, see Pod phase in the Kubernetes documentation.
If you have insufficient resources available on the worker nodes, then consider deleting unnecessary pods. You can also add more resources on the worker nodes. You can use the Kubernetes Cluster Autoscaler to automatically scale your worker node group when resources in your cluster are scarce.
Insufficient CPU
$ kubectl describe pod frontend-cpu Name: frontend-cpu Namespace: default Priority: 0 Node: <none> Labels: <none> Annotations: kubernetes.io/psp: eks.privileged Status: Pending ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 22s (x14 over 13m) default-scheduler 0/3 nodes are available: 3 Insufficient cpu.
Insufficient Memory
$ kubectl describe pod frontend-memory Name: frontend-memory Namespace: default Priority: 0 Node: <none> Labels: <none> Annotations: kubernetes.io/psp: eks.privileged Status: Pending ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 80s (x14 over 15m) default-scheduler 0/3 nodes are available: 3 Insufficient memory.
If you're defined a hostPort for your pod, then follow these best practices:
- Don't specify a hostPort unless it's necessary, because the hostIP, hostPort, and protocol combination must be unique.
- If you specify a hostPort, then schedule the same number of pods as there are worker nodes.
Note: There is a limited number of places that a pod can be scheduled when you bind a pod to a hostPort.
The following example shows the output of the describe command for frontend-port-77f67cff67-2bv7w, which is in the Pending state. The pod is unscheduled because the requested host port isn't available for worker nodes in the cluster.
Port unavailable
$ kubectl describe pod frontend-port-77f67cff67-2bv7w Name: frontend-port-77f67cff67-2bv7w Namespace: default Priority: 0 Node: <none> Labels: app=frontend-port pod-template-hash=77f67cff67 Annotations: kubernetes.io/psp: eks.privileged Status: Pending IP: IPs: <none> Controlled By: ReplicaSet/frontend-port-77f67cff67 Containers: app: Image: nginx Port: 80/TCP Host Port: 80/TCP ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 11s (x7 over 6m22s) default-scheduler 0/3 nodes are available: 3 node(s) didn't have free ports for the requested pod ports.
If the pods are unable to schedule because the nodes have taints that the pod can't allow, then the example output is similar to the following:
$ kubectl describe pod nginx Name: nginx Namespace: default Priority: 0 Node: <none> Labels: run=nginx Annotations: kubernetes.io/psp: eks.privileged Status: Pending ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 8s (x10 over 9m22s) default-scheduler 0/3 nodes are available: 3 node(s) had taint {key1: value1}, that the pod didn't tolerate.
You can check your nodes taints with following command:
$ kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints NAME TAINTS ip-192-168-4-78.ap-southeast-2.compute.internal [map[effect:NoSchedule key:key1 value:value1]] ip-192-168-56-162.ap-southeast-2.compute.internal [map[effect:NoSchedule key:key1 value:value1]] ip-192-168-91-249.ap-southeast-2.compute.internal [map[effect:NoSchedule key:key1 value:value1]]
If you want to retain your node taints, then you can specify a toleration for a pod in the PodSpec. For more information, see the Concepts section in the Kubernetes documentation.
-or-
Remove the node taint by appending - at the end of taint value:
$ kubectl taint nodes NODE_Name key1=value1:NoSchedule-
If your pods are still in the Pending state after trying the preceding steps, then complete the steps in the Additional troubleshooting section.
Your container is in the Waiting state
A container in the Waiting state is scheduled on a worker node (for example, an EC2 instance), but can't run on that node.
Your container can be in the Waiting state because of an incorrect Docker image or incorrect repository name. Or, your pod could be in the Waiting state because the image doesn't exist or you lack permissions.
If you have the incorrect Docker image or repository name, then complete the following:
1. Confirm that the image and repository name is correct by logging into Docker Hub, Amazon Elastic Container Registry (Amazon ECR), or another container image repository.
2. Compare the repository or image from the repository with the repository or image name specified in the pod specification.
If the image doesn't exist or you lack permissions, then complete the following:
1. Verify that the image specified is available in the repository and that the correct permissions are configured to allow the image to be pulled.
2. To confirm that image pull is possible and to rule out general networking and repository permission issues, manually pull the image. You must pull the image from the Amazon EKS worker nodes with Docker. For example:
$ docker pull yourImageURI:yourImageTag
3. To verify that the image exists, check that both the image and tag are present in either Docker Hub or Amazon ECR.
Note: If you're using Amazon ECR, then verify that the repository policy allows image pull for the NodeInstanceRole. Or, verify that the AmazonEC2ContainerRegistryReadOnly role is attached to the policy.
The following example shows a pod in the Pending state with the container in the Waiting state because of an image pull error:
$ kubectl describe po web-test Name: web-test Namespace: default Priority: 0 PriorityClassName: <none> Node: ip-192-168-6-51.us-east-2.compute.internal/192.168.6.51 Start Time: Wed, 22 Jul 2021 08:18:16 +0200 Labels: app=web-test Annotations: kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"app":"web-test"},"name":"web-test","namespace":"default"},"spec":{... kubernetes.io/psp: eks.privileged Status: Pending IP: 192.168.1.143 Containers: web-test: Container ID: Image: somerandomnonexistentimage Image ID: Port: 80/TCP Host Port: 0/TCP State: Waiting Reason: ErrImagePull ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 66s default-scheduler Successfully assigned default/web-test to ip-192-168-6-51.us-east-2.compute.internal Normal Pulling 14s (x3 over 65s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Pulling image "somerandomnonexistentimage" Warning Failed 14s (x3 over 55s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Failed to pull image "somerandomnonexistentimage": rpc error: code = Unknown desc = Error response from daemon: pull access denied for somerandomnonexistentimage, repository does not exist or may require 'docker login' Warning Failed 14s (x3 over 55s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Error: ErrImagePull
If your containers are still in the Waiting state after trying the preceding steps, then complete the steps in the Additional troubleshooting section.
Your pod is in the CrashLoopBackOff state
Pods stuck in CrashLoopBackOff are starting and crashing repeatedly.
If you receive the "Back-Off restarting failed container" output message, then your container probably exited soon after Kubernetes started the container.
To look for errors in the logs of the current pod, run the following command:
$ kubectl logs YOUR_POD_NAME
To look for errors in the logs of the previous pod that crashed, run the following command:
$ kubectl logs --previous YOUR-POD_NAME
Note: For a multi-container pod, you can append the container name at the end. For example:
$ kubectl logs [-f] [-p] (POD | TYPE/NAME) [-c CONTAINER]
If the Liveness probe isn't returning a successful status, then verify that the Liveness probe is configured correctly for the application. For more information, see Configure Probes in the Kubernetes documentation.
The following example shows a pod in a CrashLoopBackOff state because the application exits after starting, Notice State, Last State, Reason, Exit Code and Restart Count along with Events.
$ kubectl describe pod crash-app-b9cf4587-66ftw Name: crash-app-b9cf4587-66ftw Namespace: default Priority: 0 Node: ip-192-168-91-249.ap-southeast-2.compute.internal/192.168.91.249 Start Time: Tue, 12 Oct 2021 12:24:44 +1100 Labels: app=crash-app pod-template-hash=b9cf4587 Annotations: kubernetes.io/psp: eks.privileged Status: Running IP: 192.168.82.93 IPs: IP: 192.168.82.93 Controlled By: ReplicaSet/crash-app-b9cf4587 Containers: alpine: Container ID: containerd://a36709d9520db92d7f6d9ee02ab80125a384fee178f003ee0b0fcfec303c2e58 Image: alpine Image ID: docker.io/library/alpine@sha256:e1c082e3d3c45cccac829840a25941e679c25d438cc8412c2fa221cf1a824e6a Port: <none> Host Port: <none> State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Completed Exit Code: 0 Started: Tue, 12 Oct 2021 12:26:21 +1100 Finished: Tue, 12 Oct 2021 12:26:21 +1100 Ready: False Restart Count: 4 ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 2m30s default-scheduler Successfully assigned default/crash-app-b9cf4587-66ftw to ip-192-168-91-249.ap-southeast-2.compute.internal Normal Pulled 2m25s kubelet Successfully pulled image "alpine" in 5.121853269s Normal Pulled 2m22s kubelet Successfully pulled image "alpine" in 1.894443044s Normal Pulled 2m3s kubelet Successfully pulled image "alpine" in 1.878057673s Normal Created 97s (x4 over 2m25s) kubelet Created container alpine Normal Started 97s (x4 over 2m25s) kubelet Started container alpine Normal Pulled 97s kubelet Successfully pulled image "alpine" in 1.872870869s Warning BackOff 69s (x7 over 2m21s) kubelet Back-off restarting failed container Normal Pulling 55s (x5 over 2m30s) kubelet Pulling image "alpine" Normal Pulled 53s kubelet Successfully pulled image "alpine" in 1.858871422s
Example of liveness probe failing for the pod:
$ kubectl describe pod nginx Name: nginx Namespace: default Priority: 0 Node: ip-192-168-91-249.ap-southeast-2.compute.internal/192.168.91.249 Start Time: Tue, 12 Oct 2021 13:07:55 +1100 Labels: app=nginx Annotations: kubernetes.io/psp: eks.privileged Status: Running IP: 192.168.79.220 IPs: IP: 192.168.79.220 Containers: nginx: Container ID: containerd://950740197c425fa281c205a527a11867301b8ec7a0f2a12f5f49d8687a0ee911 Image: nginx Image ID: docker.io/library/nginx@sha256:06e4235e95299b1d6d595c5ef4c41a9b12641f6683136c18394b858967cd1506 Port: 80/TCP Host Port: 0/TCP State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Completed Exit Code: 0 Started: Tue, 12 Oct 2021 13:10:06 +1100 Finished: Tue, 12 Oct 2021 13:10:13 +1100 Ready: False Restart Count: 5 Liveness: http-get http://:8080/ delay=3s timeout=1s period=2s #success=1 #failure=3 ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 2m47s default-scheduler Successfully assigned default/nginx to ip-192-168-91-249.ap-southeast-2.compute.internal Normal Pulled 2m44s kubelet Successfully pulled image "nginx" in 1.891238002s Normal Pulled 2m35s kubelet Successfully pulled image "nginx" in 1.878230117s Normal Created 2m25s (x3 over 2m44s) kubelet Created container nginx Normal Started 2m25s (x3 over 2m44s) kubelet Started container nginx Normal Pulled 2m25s kubelet Successfully pulled image "nginx" in 1.876232575s Warning Unhealthy 2m17s (x9 over 2m41s) kubelet Liveness probe failed: Get "http://192.168.79.220:8080/": dial tcp 192.168.79.220:8080: connect: connection refused Normal Killing 2m17s (x3 over 2m37s) kubelet Container nginx failed liveness probe, will be restarted Normal Pulling 2m17s (x4 over 2m46s) kubelet Pulling image "nginx"
If your pods are still in the CrashLoopBackOff state after trying the preceding steps, then complete the steps in the Additional troubleshooting section.
Additional troubleshooting
If your pod is still stuck after completing steps in the previous sections, then try the following steps:
1. To confirm that worker nodes exist in the cluster and are in Ready status, run the following command:
$ kubectl get nodes
Example output:
NAME STATUS ROLES AGE VERSION ip-192-168-6-51.us-east-2.compute.internal Ready <none> 25d v1.21.2-eks-5047ed ip-192-168-86-33.us-east-2.compute.internal Ready <none> 25d v1.21.2-eks-5047ed
If the nodes are NotReady, then see How can I change the status of my nodes from NotReady or Unknown status to Ready status? or can't join the cluster, How can I get my worker nodes to join my Amazon EKS cluster?
2. To check the version of the Kubernetes cluster, run the following command:
$ kubectl version --short
Example output:
Client Version: v1.21.2-eks-5047ed Server Version: v1.21.2-eks-c0eccc
3. To check the version of the Kubernetes worker node, run the following command:
$ kubectl get node -o custom-columns=NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion
Example output:
NAME VERSION ip-192-168-6-51.us-east-2.compute.internal v1.21.2-eks-5047ed ip-192-168-86-33.us-east-2.compute.internal v1.21.2-eks-5047ed
4. Confirm that the Kubernetes server version for the cluster matches the version of the worker nodes within an acceptable version skew (from the Kubernetes documentation). Use the output from the preceding steps 2 and 3 as the basis for this comparison.
Important: The patch versions can be different (for example, v1.21.x for the cluster vs. v1.21.y for the worker node).
If the cluster and worker node versions are incompatible, then create a new node group with eksctl (see the eksctl tab) or AWS CloudFormation (see the Self-managed nodes tab).
-or-
Create a new managed node group (Kubernetes: v1.21, platform: eks.1 and above) using a compatible Kubernetes version. Then, delete the node group with the incompatible Kubernetes version.
5. Confirm that the Kubernetes control plane can communicate with the worker nodes by verifying firewall rules against recommended rules in Amazon EKS security group considerations. Then, verify that the nodes are in Ready status.
Vídeos relacionados

Contenido relevante
- OFICIAL DE AWSActualizada hace 7 meses
- OFICIAL DE AWSActualizada hace 5 meses
- OFICIAL DE AWSActualizada hace 4 meses
- OFICIAL DE AWSActualizada hace 4 meses