How do I troubleshoot Kubernetes Pod issues in Amazon EKS?
The Kubernetes Pods in my Amazon Elastic Kubernetes Service (Amazon EKS) cluster fail. I want to identify the root cause of the Pod failure.
Resolution
Identify the error that causes your Pod issue
-
To get information about your Pods, run the following kubectl describe command:
kubectl describe pod YOUR_POD_NAME -n YOUR_NAMESPACENote: Replace YOUR_POD_NAME with your Pod name and YOUR_NAMESPACE with your namespace.
-
Identify the error message in the Events section of the command's output.
Example output:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 24s default-scheduler no nodes available to schedule pods Warning FailedScheduling 19s (x2 over 22s) default-scheduler no nodes available to schedule pods
Based on the error message that you receive, use the following troubleshooting to resolve your Pod issue.
EBS volume mounting issues
The following example output is from a kubectl describe pod ebs-pod command. The output shows a volume node affinity error for Amazon Elastic Block Store (Amazon EBS) volume mounting:
Name: ebs-pod ... Status: Pending ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 88s (x20 over 96m) default-scheduler 0/2 nodes are available 2 node(s) had volume node affinity conflict
The preceding error occurs when you schedule Persistent Volume Claims (PVCs) for your Pod in multiple Availability Zones. You can't schedule your Pod because the Pod can't connect to the volume from another Availability Zone. To resolve this issue, you must schedule PVCs in one Availability Zone.
To troubleshoot the preceding error, complete the following steps:
-
To get information about all the PVCs in your namespace, run the following kubectl get pvc command:
kubectl get pvc -n YOUR_NAMESPACENote: Replace YOUR_NAMESPACE with your namespace.
-
To get information about your Persistent Volume (PV), run the following kubectl get pv command:
kubectl get pv -
To find the PV that corresponds to your PVC, run the following kubectl describe pv command:
kubectl describe pv your_PVNote: Replace your_PV with your PV name.
-
Confirm that the volume ID that you receive from the preceding command associates with the correct Availability Zone.
-
Check the node where the Availability Zone is located.
If you get a volume node affinity conflict, then take one of the following actions:
- Use taints and tolerations to make sure that Pods that use Amazon EBS mounting get scheduled on the correct node. The node must be in the Availability Zone where the EBS volume is. For more information, see Taints and tolerations on the Kubernetes website.
- Use StatefulSets instead of a Deployment to create a unique EBS volume in the same Availability Zone for each Pod in the StatefulSet. For more information, see StatefulSets on the Kubernetes website.
Your Pod or StatefulSet might be stuck in the Pending state. This is true even when your Pod or StatefulSet is in the same Availability Zone as the EBS volume. To resolve this issue, run the following kubectl logs command to check the logs of the Amazon EBS CSI driver Pods:
kubectl logs your-ebs-csi-controller -n your-kube-system -c your-csi-provisioner
Note: Replace your-ebs-csi-controller with your Amazon EBS CSI controller. Also, replace your-kube-system with your predefined namespace and your-csi-provisioner with a container name that you use to pull logs.
ContainerCreating state error
The following error message occurs when your Pod is stuck in the ContainerCreating state and the networkPlugin: cni doesn't assign an IP address to your Pod:
"Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "0fdf25254b1888afeda8bf89bc1dcb093d0661ae2c8c65a4736e473c73714c65" network for pod "test": networkPlugin cni failed to set up pod "test" network: add cmd: failed to assign an IP address to container."
To troubleshoot the ContainerCreating state error, take the following actions:
- Check whether your subnet has an available IP address to resolve the issue. Open the Amazon Virtual Private Cloud (Amazon VPC) console. In the navigation pane, under Virtual private cloud, choose Subnets.
- Verify that the Pod for aws-node is in the Running state. Also, make sure that you use the latest supported version of the Amazon VPC CNI.
- Check whether the number of Pods on the instance reached the maximum number of Pods.
- In the node where you scheduled your Pod, look for error messages in the ipamd logs and the plugin under the var/log/aws-routed-eni path.
CrashLoopBackOff state error
You receive the "Back-Off restarting failed container" error message.
The preceding error message occurs because the container repeatedly fails to start and then enters the CrashLoopBackOff state and persistently restarts within the Pod.
The following issues can cause the container to repeatedly fail to start:
- Insufficient memory
- Resource overload
- Deployment errors
- External dependency issues such as DNS errors
- Third-party dependencies
- Container-level failures caused by port conflicts
To get the errors in the logs of the current Pod, run the following kubectl logs command:
kubectl logs YOUR_POD_NAME -n YOUR_NAMESPACE
Note: Replace YOUR_POD_NAME with your Pod name and YOUR_NAMESPACE with your namespace.
To get errors in the logs of the previous Pod that crashed, run the following kubectl logs --previous command:
kubectl logs --previous YOUR_POD_NAME -n YOUR_NAMESPACE
Note: Replace YOUR_POD_NAME with your Pod name and YOUR_NAMESPACE with your namespace.
Probe failure errors
When a Pod crashes, you get a probe failure error because of a refused connection or client timeout.
Troubleshoot a refused connection
If a probe failed because of a refused connection, then you might get one of the following error messages:
- "Liveness probe failed: Get https://$POD_IP:8080/<healthcheck_path>: dial tcp POD_IP:8080: connect: connection refused."
- "Readiness probe failed: Get https://$POD_IP:8080/<healthcheck_path>: dial tcp POD_IP:8080: connect: connection refused."
To troubleshoot a refused connection, complete the following steps:
-
To manually get the health check path that's defined on the Pod manifest from the worker node, run the following command:
[ec2-user@ip-10-5-1-12 ~]$ curl -ikv podIP:8080//your_healthcheck_pathNote: Replace podIP with your Pod's IP address and your_healthcheck_path with your path name.
-
Check the health check path that's defined on the Pod manifest for the Pod that failed the liveness probe or readiness probe. To check the health check path, run the following command:
local@bastion-host ~ % kubectl exec YOUR_POD_NAME -- curl -ikv "http://localhost:8080/your_healthcheck_path"Note: Replace YOUR_POD_NAME with your Pod name.
-
Run the same container image on the bastion host.
-
Check whether you can get the health check path that's defined on the probes in the manifest. Then, check the container logs for failures, timeouts, or errors.
-
To check for errors in the kubelet logs of the worker node where your Pod runs, run the following journalctl command:
[ec2-user@ip-10-5-1-12 ~]$ journalctl -u kubelet //optionally 'grep' with pod name
Troubleshoot a client timeout
If a probe failed because of a client timeout, then you might get one of the following error messages:
- "Liveness probe failed: Get "http://podIP:8080/<healthcheck_path> ": context deadline exceeded (Client.Timeout exceeded while awaiting headers)."
- "Readiness probe failed: Get "http://podIP:8080/<healthcheck_path> ": context deadline exceeded (Client.Timeout exceeded while awaiting headers)."
To troubleshoot the client timeout, check whether the configuration is correct for the liveness probes and readiness probes for your application pods.
If you use a security group for Pods and ENABLE_POD_ENI=true, then you must turn off TCP early demux. This action lets the kubelet connect to the pods on the branch network interfaces that use TCP.
To turn off TCP early demux, run the following kubectl patch command:
kubectl patch daemonset aws-node -n kube-system \-p '{"spec": {"template": {"spec": {"initContainers": [{"env":[{"name":"DISABLE_TCP_EARLY_DEMUX","value":"true"}],"name":"aws-vpc-cni-init"}]}}}}'
ImagePullBackOff error
The ImagePullBackOff error occurs when a container that's running in a Pod fails to pull the required image from a container registry.
The following issues can cause this error:
- Network connectivity issues
- Incorrect image name or tag
- Missing credentials
- Insufficient permissions
To determine what caused the issue, complete the following steps:
-
To get the status of your Pod, run the following command:
kubectl get pods -n YOUR_NAMESPACENote: Replace YOUR_NAMESPACE with your namespace.
-
To get failure details about your Pod, run the following command:
kubectl describe pod YOUR_POD_NAME -n YOUR_NAMESPACENote: Replace YOUR_POD_NAME with your Pod name and YOUR_NAMESPACE with your namespace.
Example output:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 18m default-scheduler Successfully assigned kube-system/kube-proxy-h4np6 to XXX.XXX.eu-west-1.compute.internal Normal Pulling 16m (x4 over 18m) kubelet Pulling image "<account-id>.dkr.ecr.eu-west-1.amazonaws.com/eks/kube-proxy:v1.21.5-eksbuild.2" Warning Failed 16m (x4 over 18m) kubelet Failed to pull image "<account-d>.dkr.ecr.eu-west-1.amazonaws.com/eks/kube-proxy:v1.21.5-eksbuild.2": rpc error: code = Unknown desc = Error response from daemon: manifest for <account-id>.dkr.ecr.eu-west-1.amazonaws.com/eks/kube-proxy:v1.21.5-eksbuild.2 not found: manifest unknown: Requested image not found
To troubleshoot the ImagePullBackOff error, see How can I troubleshoot the Pod status ErrImagePull and ImagePullBackoff errors in Amazon EKS?
- Topics
- Containers
- Language
- English

Relevant content
- Accepted Answerasked 3 years ago
- asked a year ago
- asked 6 months ago
- asked 3 years ago
AWS OFFICIALUpdated 2 years ago