How can I troubleshoot the pod status in Amazon EKS?

11 minutos de lectura
0

My Amazon Elastic Kubernetes Service (Amazon EKS) pods that are running on Amazon Elastic Compute Cloud (Amazon EC2) instances or on a managed node group are stuck. I want to get my pods in the Running state.

Resolution

Important: The following steps apply only to pods launched on Amazon EC2 instances or a managed node group. These steps don't apply to pods launched on AWS Fargate.

Find out the status of your pod

1.    To get the status of your pod, run the following command:

$ kubectl get pod

2.    To get information from the Events history of your pod, run the following command:

$ kubectl describe pod YOUR_POD_NAME

Note: The example commands covered in the following steps are in the default namespace. For other namespaces, append the command with -n YOURNAMESPACE.

3.    Based on the status of your pod, complete the steps in one of the following sections: Your pod is in the Pending state, Your pod is in the Waiting state, or Your pod is in the CrashLoopBackOff state.

Your pod is in the Pending state

Pods in the Pending state can't be scheduled onto a node. This can occur due to insufficient resources or with the use of hostPort. For more information, see Pod phase in the Kubernetes documentation.

If you have insufficient resources available on the worker nodes, then consider deleting unnecessary pods. You can also add more resources on the worker nodes. You can use the Kubernetes Cluster Autoscaler to automatically scale your worker node group when resources in your cluster are scarce.

Insufficient CPU

$ kubectl describe pod frontend-cpu                               
Name:         frontend-cpu
Namespace:    default
Priority:     0
Node:         <none>
Labels:       <none>
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Pending
...
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  22s (x14 over 13m)  default-scheduler  0/3 nodes are available: 3 Insufficient cpu.

Insufficient Memory

$ kubectl describe pod frontend-memory
Name:         frontend-memory
Namespace:    default
Priority:     0
Node:         <none>
Labels:       <none>
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Pending
...
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  80s (x14 over 15m)  default-scheduler  0/3 nodes are available: 3 Insufficient memory.

If you're defined a hostPort for your pod, then follow these best practices:

  • Don't specify a hostPort unless it's necessary, because the hostIP, hostPort, and protocol combination must be unique.
  • If you specify a hostPort, then schedule the same number of pods as there are worker nodes.

Note: There is a limited number of places that a pod can be scheduled when you bind a pod to a hostPort.

The following example shows the output of the describe command for frontend-port-77f67cff67-2bv7w, which is in the Pending state. The pod is unscheduled because the requested host port isn't available for worker nodes in the cluster.

Port unavailable

$ kubectl describe pod frontend-port-77f67cff67-2bv7w                                            
Name:           frontend-port-77f67cff67-2bv7w
Namespace:      default
Priority:       0
Node:           <none>
Labels:         app=frontend-port
                pod-template-hash=77f67cff67
Annotations:    kubernetes.io/psp: eks.privileged
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/frontend-port-77f67cff67
Containers:
  app:
    Image:      nginx
    Port:       80/TCP
    Host Port:  80/TCP
...
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  11s (x7 over 6m22s)  default-scheduler  0/3 nodes are available: 3 node(s) didn't have free ports for the requested pod ports.

If the pods are unable to schedule because the nodes have taints that the pod can't allow, then the example output is similar to the following:

$ kubectl describe pod nginx                                                  
Name:         nginx
Namespace:    default
Priority:     0
Node:         <none>
Labels:       run=nginx
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Pending
...
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  8s (x10 over 9m22s)  default-scheduler  0/3 nodes are available: 3 node(s) had taint {key1: value1}, that the pod didn't tolerate.

You can check your nodes taints with following command:

$ kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints             
NAME                                                TAINTS
ip-192-168-4-78.ap-southeast-2.compute.internal     [map[effect:NoSchedule key:key1 value:value1]]
ip-192-168-56-162.ap-southeast-2.compute.internal   [map[effect:NoSchedule key:key1 value:value1]]
ip-192-168-91-249.ap-southeast-2.compute.internal   [map[effect:NoSchedule key:key1 value:value1]]

If you want to retain your node taints, then you can specify a toleration for a pod in the PodSpec. For more information, see the Concepts section in the Kubernetes documentation.

-or-

Remove the node taint by appending - at the end of taint value:

$ kubectl taint nodes NODE_Name key1=value1:NoSchedule-

If your pods are still in the Pending state after trying the preceding steps, then complete the steps in the Additional troubleshooting section.

Your container is in the Waiting state

A container in the Waiting state is scheduled on a worker node (for example, an EC2 instance), but can't run on that node.

Your container can be in the Waiting state because of an incorrect Docker image or incorrect repository name. Or, your pod could be in the Waiting state because the image doesn't exist or you lack permissions.

If you have the incorrect Docker image or repository name, then complete the following:

1.    Confirm that the image and repository name is correct by logging into Docker Hub, Amazon Elastic Container Registry (Amazon ECR), or another container image repository.

2.    Compare the repository or image from the repository with the repository or image name specified in the pod specification.

If the image doesn't exist or you lack permissions, then complete the following:

1.    Verify that the image specified is available in the repository and that the correct permissions are configured to allow the image to be pulled.

2.    To confirm that image pull is possible and to rule out general networking and repository permission issues, manually pull the image. You must pull the image from the Amazon EKS worker nodes with Docker. For example:

$ docker pull yourImageURI:yourImageTag

3.    To verify that the image exists, check that both the image and tag are present in either Docker Hub or Amazon ECR.

Note: If you're using Amazon ECR, then verify that the repository policy allows image pull for the NodeInstanceRole. Or, verify that the AmazonEC2ContainerRegistryReadOnly role is attached to the policy.

The following example shows a pod in the Pending state with the container in the Waiting state because of an image pull error:

$ kubectl describe po web-test

Name:               web-test
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               ip-192-168-6-51.us-east-2.compute.internal/192.168.6.51
Start Time:         Wed, 22 Jul 2021 08:18:16 +0200
Labels:             app=web-test
Annotations:        kubectl.kubernetes.io/last-applied-configuration:
                      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"app":"web-test"},"name":"web-test","namespace":"default"},"spec":{...
                    kubernetes.io/psp: eks.privileged
Status:             Pending
IP:                 192.168.1.143
Containers:
  web-test:
    Container ID:   
    Image:          somerandomnonexistentimage
    Image ID:       
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ErrImagePull
...
Events:
  Type     Reason            Age                 From                                                 Message
  ----     ------            ----                ----                                                 -------
  Normal   Scheduled         66s                 default-scheduler                                    Successfully assigned default/web-test to ip-192-168-6-51.us-east-2.compute.internal
  Normal   Pulling           14s (x3 over 65s)   kubelet, ip-192-168-6-51.us-east-2.compute.internal  Pulling image "somerandomnonexistentimage"
  Warning  Failed            14s (x3 over 55s)   kubelet, ip-192-168-6-51.us-east-2.compute.internal  Failed to pull image "somerandomnonexistentimage": rpc error: code = Unknown desc = Error response from daemon: pull access denied for somerandomnonexistentimage, repository does not exist or may require 'docker login'
  Warning  Failed            14s (x3 over 55s)   kubelet, ip-192-168-6-51.us-east-2.compute.internal  Error: ErrImagePull

If your containers are still in the Waiting state after trying the preceding steps, then complete the steps in the Additional troubleshooting section.

Your pod is in the CrashLoopBackOff state

Pods stuck in CrashLoopBackOff are starting and crashing repeatedly.

If you receive the "Back-Off restarting failed container" output message, then your container probably exited soon after Kubernetes started the container.

To look for errors in the logs of the current pod, run the following command:

$ kubectl logs YOUR_POD_NAME

To look for errors in the logs of the previous pod that crashed, run the following command:

$ kubectl logs --previous YOUR-POD_NAME

Note: For a multi-container pod, you can append the container name at the end. For example:

$ kubectl logs [-f] [-p] (POD | TYPE/NAME) [-c CONTAINER]

If the Liveness probe isn't returning a successful status, then verify that the Liveness probe is configured correctly for the application. For more information, see Configure Probes in the Kubernetes documentation.

The following example shows a pod in a CrashLoopBackOff state because the application exits after starting, Notice State, Last State, Reason, Exit Code and Restart Count along with Events.

$ kubectl describe pod crash-app-b9cf4587-66ftw 
Name:         crash-app-b9cf4587-66ftw
Namespace:    default
Priority:     0
Node:         ip-192-168-91-249.ap-southeast-2.compute.internal/192.168.91.249
Start Time:   Tue, 12 Oct 2021 12:24:44 +1100
Labels:       app=crash-app
              pod-template-hash=b9cf4587
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Running
IP:           192.168.82.93
IPs:
  IP:           192.168.82.93
Controlled By:  ReplicaSet/crash-app-b9cf4587
Containers:
  alpine:
    Container ID:   containerd://a36709d9520db92d7f6d9ee02ab80125a384fee178f003ee0b0fcfec303c2e58
    Image:          alpine
    Image ID:       docker.io/library/alpine@sha256:e1c082e3d3c45cccac829840a25941e679c25d438cc8412c2fa221cf1a824e6a
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 12 Oct 2021 12:26:21 +1100
      Finished:     Tue, 12 Oct 2021 12:26:21 +1100
    Ready:          False
    Restart Count:  4
    ...
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  2m30s                default-scheduler  Successfully assigned default/crash-app-b9cf4587-66ftw to ip-192-168-91-249.ap-southeast-2.compute.internal
  Normal   Pulled     2m25s                kubelet            Successfully pulled image "alpine" in 5.121853269s
  Normal   Pulled     2m22s                kubelet            Successfully pulled image "alpine" in 1.894443044s
  Normal   Pulled     2m3s                 kubelet            Successfully pulled image "alpine" in 1.878057673s
  Normal   Created    97s (x4 over 2m25s)  kubelet            Created container alpine
  Normal   Started    97s (x4 over 2m25s)  kubelet            Started container alpine
  Normal   Pulled     97s                  kubelet            Successfully pulled image "alpine" in 1.872870869s
  Warning  BackOff    69s (x7 over 2m21s)  kubelet            Back-off restarting failed container
  Normal   Pulling    55s (x5 over 2m30s)  kubelet            Pulling image "alpine"
  Normal   Pulled     53s                  kubelet            Successfully pulled image "alpine" in 1.858871422s

Example of liveness probe failing for the pod:

$ kubectl describe pod nginx
Name:         nginx
Namespace:    default
Priority:     0
Node:         ip-192-168-91-249.ap-southeast-2.compute.internal/192.168.91.249
Start Time:   Tue, 12 Oct 2021 13:07:55 +1100
Labels:       app=nginx
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Running
IP:           192.168.79.220
IPs:
  IP:  192.168.79.220
Containers:
  nginx:
    Container ID:   containerd://950740197c425fa281c205a527a11867301b8ec7a0f2a12f5f49d8687a0ee911
    Image:          nginx
    Image ID:       docker.io/library/nginx@sha256:06e4235e95299b1d6d595c5ef4c41a9b12641f6683136c18394b858967cd1506
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 12 Oct 2021 13:10:06 +1100
      Finished:     Tue, 12 Oct 2021 13:10:13 +1100
    Ready:          False
    Restart Count:  5
    Liveness:       http-get http://:8080/ delay=3s timeout=1s period=2s #success=1 #failure=3
    ...
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  2m47s                  default-scheduler  Successfully assigned default/nginx to ip-192-168-91-249.ap-southeast-2.compute.internal
  Normal   Pulled     2m44s                  kubelet            Successfully pulled image "nginx" in 1.891238002s
  Normal   Pulled     2m35s                  kubelet            Successfully pulled image "nginx" in 1.878230117s
  Normal   Created    2m25s (x3 over 2m44s)  kubelet            Created container nginx
  Normal   Started    2m25s (x3 over 2m44s)  kubelet            Started container nginx
  Normal   Pulled     2m25s                  kubelet            Successfully pulled image "nginx" in 1.876232575s
  Warning  Unhealthy  2m17s (x9 over 2m41s)  kubelet            Liveness probe failed: Get "http://192.168.79.220:8080/": dial tcp 192.168.79.220:8080: connect: connection refused
  Normal   Killing    2m17s (x3 over 2m37s)  kubelet            Container nginx failed liveness probe, will be restarted
  Normal   Pulling    2m17s (x4 over 2m46s)  kubelet            Pulling image "nginx"

If your pods are still in the CrashLoopBackOff state after trying the preceding steps, then complete the steps in the Additional troubleshooting section.

Additional troubleshooting

If your pod is still stuck after completing steps in the previous sections, then try the following steps:

1.    To confirm that worker nodes exist in the cluster and are in Ready status, run the following command:

$ kubectl get nodes

Example output:

NAME                                          STATUS   ROLES    AGE   VERSION
ip-192-168-6-51.us-east-2.compute.internal    Ready    <none>   25d   v1.21.2-eks-5047ed
ip-192-168-86-33.us-east-2.compute.internal   Ready    <none>   25d   v1.21.2-eks-5047ed

If the nodes are NotReady, then see How can I change the status of my nodes from NotReady or Unknown status to Ready status? or can't join the cluster, How can I get my worker nodes to join my Amazon EKS cluster?

2.    To check the version of the Kubernetes cluster, run the following command:

$ kubectl version --short

Example output:

Client Version: v1.21.2-eks-5047ed
Server Version: v1.21.2-eks-c0eccc

3.    To check the version of the Kubernetes worker node, run the following command:

$ kubectl get node -o custom-columns=NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion

Example output:

NAME                                          VERSION
ip-192-168-6-51.us-east-2.compute.internal    v1.21.2-eks-5047ed
ip-192-168-86-33.us-east-2.compute.internal   v1.21.2-eks-5047ed

4.    Confirm that the Kubernetes server version for the cluster matches the version of the worker nodes within an acceptable version skew (from the Kubernetes documentation). Use the output from the preceding steps 2 and 3 as the basis for this comparison.

Important: The patch versions can be different (for example, v1.21.x for the cluster vs. v1.21.y for the worker node).

If the cluster and worker node versions are incompatible, then create a new node group with eksctl (see the eksctl tab) or AWS CloudFormation (see the Self-managed nodes tab).

-or-

Create a new managed node group (Kubernetes: v1.21, platform: eks.1 and above) using a compatible Kubernetes version. Then, delete the node group with the incompatible Kubernetes version.

5.    Confirm that the Kubernetes control plane can communicate with the worker nodes by verifying firewall rules against recommended rules in Amazon EKS security group considerations. Then, verify that the nodes are in Ready status.


OFICIAL DE AWS
OFICIAL DE AWSActualizada hace 2 años