Skip to content

How do I troubleshoot Kubernetes Pod issues in Amazon EKS?

8 minute read
0

The Kubernetes Pods in my Amazon Elastic Kubernetes Service (Amazon EKS) cluster fail. I want to identify the root cause of the Pod failure.

Resolution

Identify the error that causes your Pod issue

  1. To get information about your Pods, run the following kubectl describe command:

    kubectl describe pod YOUR_POD_NAME -n YOUR_NAMESPACE

    Note: Replace YOUR_POD_NAME with your Pod name and YOUR_NAMESPACE with your namespace.

  2. Identify the error message in the Events section of the command's output.

    Example output:

    Events:
      Type     Reason            Age                From               Message
      ----     ------            ----               ----               -------
      Warning  FailedScheduling  24s                default-scheduler  no nodes available to schedule pods
      Warning  FailedScheduling  19s (x2 over 22s)  default-scheduler  no nodes available to schedule pods

Based on the error message that you receive, use the following troubleshooting to resolve your Pod issue.

EBS volume mounting issues

The following example output is from a kubectl describe pod ebs-pod command. The output shows a volume node affinity error for Amazon Elastic Block Store (Amazon EBS) volume mounting:

Name:         ebs-pod
...
Status:       Pending
...
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning FailedScheduling 88s (x20 over 96m) default-scheduler 0/2 nodes are available 2 node(s) had volume node affinity conflict

The preceding error occurs when you schedule Persistent Volume Claims (PVCs) for your Pod in multiple Availability Zones. You can't schedule your Pod because the Pod can't connect to the volume from another Availability Zone. To resolve this issue, you must schedule PVCs in one Availability Zone.

To troubleshoot the preceding error, complete the following steps:

  1. To get information about all the PVCs in your namespace, run the following kubectl get pvc command:

    kubectl get pvc -n YOUR_NAMESPACE

    Note: Replace YOUR_NAMESPACE with your namespace.

  2. To get information about your Persistent Volume (PV), run the following kubectl get pv command:

    kubectl get pv
  3. To find the PV that corresponds to your PVC, run the following kubectl describe pv command:

    kubectl describe pv your_PV

    Note: Replace your_PV with your PV name.

  4. Confirm that the volume ID that you receive from the preceding command associates with the correct Availability Zone.

  5. Check the node where the Availability Zone is located.

If you get a volume node affinity conflict, then take one of the following actions:

  • Use taints and tolerations to make sure that Pods that use Amazon EBS mounting get scheduled on the correct node. The node must be in the Availability Zone where the EBS volume is. For more information, see Taints and tolerations on the Kubernetes website.
  • Use StatefulSets instead of a Deployment to create a unique EBS volume in the same Availability Zone for each Pod in the StatefulSet. For more information, see StatefulSets on the Kubernetes website.

Your Pod or StatefulSet might be stuck in the Pending state. This is true even when your Pod or StatefulSet is in the same Availability Zone as the EBS volume. To resolve this issue, run the following kubectl logs command to check the logs of the Amazon EBS CSI driver Pods:

kubectl logs your-ebs-csi-controller -n your-kube-system -c your-csi-provisioner

Note: Replace your-ebs-csi-controller with your Amazon EBS CSI controller. Also, replace your-kube-system with your predefined namespace and your-csi-provisioner with a container name that you use to pull logs.

ContainerCreating state error

The following error message occurs when your Pod is stuck in the ContainerCreating state and the networkPlugin: cni doesn't assign an IP address to your Pod:

"Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "0fdf25254b1888afeda8bf89bc1dcb093d0661ae2c8c65a4736e473c73714c65" network for pod "test": networkPlugin cni failed to set up pod "test" network: add cmd: failed to assign an IP address to container."

To troubleshoot the ContainerCreating state error, take the following actions:

  • Check whether your subnet has an available IP address to resolve the issue. Open the Amazon Virtual Private Cloud (Amazon VPC) console. In the navigation pane, under Virtual private cloud, choose Subnets.
  • Verify that the Pod for aws-node is in the Running state. Also, make sure that you use the latest supported version of the Amazon VPC CNI.
  • Check whether the number of Pods on the instance reached the maximum number of Pods.
  • In the node where you scheduled your Pod, look for error messages in the ipamd logs and the plugin under the var/log/aws-routed-eni path.

CrashLoopBackOff state error

You receive the "Back-Off restarting failed container" error message.

The preceding error message occurs because the container repeatedly fails to start and then enters the CrashLoopBackOff state and persistently restarts within the Pod.

The following issues can cause the container to repeatedly fail to start:

  • Insufficient memory
  • Resource overload
  • Deployment errors
  • External dependency issues such as DNS errors
  • Third-party dependencies
  • Container-level failures caused by port conflicts

To get the errors in the logs of the current Pod, run the following kubectl logs command:

kubectl logs YOUR_POD_NAME -n YOUR_NAMESPACE

Note: Replace YOUR_POD_NAME with your Pod name and YOUR_NAMESPACE with your namespace.

To get errors in the logs of the previous Pod that crashed, run the following kubectl logs --previous command:

kubectl logs --previous YOUR_POD_NAME -n YOUR_NAMESPACE

Note: Replace YOUR_POD_NAME with your Pod name and YOUR_NAMESPACE with your namespace.

Probe failure errors

When a Pod crashes, you get a probe failure error because of a refused connection or client timeout.

Troubleshoot a refused connection

If a probe failed because of a refused connection, then you might get one of the following error messages:

  • "Liveness probe failed: Get https://$POD_IP:8080/<healthcheck_path>: dial tcp POD_IP:8080: connect: connection refused."
  • "Readiness probe failed: Get https://$POD_IP:8080/<healthcheck_path>: dial tcp POD_IP:8080: connect: connection refused."

To troubleshoot a refused connection, complete the following steps:

  1. To manually get the health check path that's defined on the Pod manifest from the worker node, run the following command:

    [ec2-user@ip-10-5-1-12 ~]$ curl -ikv podIP:8080//your_healthcheck_path

    Note: Replace podIP with your Pod's IP address and your_healthcheck_path with your path name.

  2. Check the health check path that's defined on the Pod manifest for the Pod that failed the liveness probe or readiness probe. To check the health check path, run the following command:

    local@bastion-host ~ % kubectl exec YOUR_POD_NAME -- curl -ikv "http://localhost:8080/your_healthcheck_path"

    Note: Replace YOUR_POD_NAME with your Pod name.

  3. Run the same container image on the bastion host.

  4. Check whether you can get the health check path that's defined on the probes in the manifest. Then, check the container logs for failures, timeouts, or errors.

  5. To check for errors in the kubelet logs of the worker node where your Pod runs, run the following journalctl command:

    [ec2-user@ip-10-5-1-12 ~]$ journalctl -u kubelet //optionally 'grep' with pod name

Troubleshoot a client timeout

If a probe failed because of a client timeout, then you might get one of the following error messages:

  • "Liveness probe failed: Get "http://podIP:8080/<healthcheck_path> ": context deadline exceeded (Client.Timeout exceeded while awaiting headers)."
  • "Readiness probe failed: Get "http://podIP:8080/<healthcheck_path> ": context deadline exceeded (Client.Timeout exceeded while awaiting headers)."

To troubleshoot the client timeout, check whether the configuration is correct for the liveness probes and readiness probes for your application pods.

If you use a security group for Pods and ENABLE_POD_ENI=true, then you must turn off TCP early demux. This action lets the kubelet connect to the pods on the branch network interfaces that use TCP.

To turn off TCP early demux, run the following kubectl patch command:

kubectl patch daemonset aws-node -n kube-system \-p '{"spec": {"template": {"spec": {"initContainers": [{"env":[{"name":"DISABLE_TCP_EARLY_DEMUX","value":"true"}],"name":"aws-vpc-cni-init"}]}}}}'

ImagePullBackOff error

The ImagePullBackOff error occurs when a container that's running in a Pod fails to pull the required image from a container registry.

The following issues can cause this error:

  • Network connectivity issues
  • Incorrect image name or tag
  • Missing credentials
  • Insufficient permissions

To determine what caused the issue, complete the following steps:

  1. To get the status of your Pod, run the following command:

    kubectl get pods -n YOUR_NAMESPACE

    Note: Replace YOUR_NAMESPACE with your namespace.

  2. To get failure details about your Pod, run the following command:

    kubectl describe pod YOUR_POD_NAME -n YOUR_NAMESPACE

    Note: Replace YOUR_POD_NAME with your Pod name and YOUR_NAMESPACE with your namespace.

    Example output:

    Events:
    Type     Reason     Age                From                Message
    ----     ------     ----               ----                -------
    Normal   Scheduled  18m                default-scheduler   Successfully assigned kube-system/kube-proxy-h4np6 to XXX.XXX.eu-west-1.compute.internal
    Normal   Pulling    16m (x4 over 18m)  kubelet             Pulling image "<account-id>.dkr.ecr.eu-west-1.amazonaws.com/eks/kube-proxy:v1.21.5-eksbuild.2"
    Warning  Failed     16m (x4 over 18m)  kubelet             Failed to pull image "<account-d>.dkr.ecr.eu-west-1.amazonaws.com/eks/kube-proxy:v1.21.5-eksbuild.2": rpc error: code = Unknown desc = Error response from daemon: manifest for <account-id>.dkr.ecr.eu-west-1.amazonaws.com/eks/kube-proxy:v1.21.5-eksbuild.2 not found: manifest unknown: Requested image not found

To troubleshoot the ImagePullBackOff error, see How can I troubleshoot the Pod status ErrImagePull and ImagePullBackoff errors in Amazon EKS?

AWS OFFICIALUpdated a month ago