Zombie pods in EKS

0

Hi all. we have recently started experiencing an increase in zombie pods when terminating them, is anyone aware what is the root cause of a pod being a zombie/stuck on terminating state? This is the error we keep on getting: error killing pod: failed to "KillContainer" for "zombie-pod" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: cannot stop container: 803b8598080nbdkau8i0n2526be67302a3748dbcbe3066ad0fae55707d1: container 803b8598080 PID 14597 is zombie and can not be killed. Use the --init option when creating containers to run an init inside the container that forwards signals and reaps processes"

1 Answer
0
Accepted Answer

Zombie pods are usually caused by containers that have Zombie processes that won't stop. If you've recently experienced more of these than usual, then I would look at what has changed in the applications/processes that you are running in the containers. The --init option is a Docker setting, that sets the ENTRYPOINT to tini. This is an init processes that becomes PID 1, and then handles your apps as process children. This is usually done when signals (SIGTERM, SIGKILL) are not being properly handled by applications. There was another option with dumb-init from Yelp.

It is always a good idea to make sure that processes, especially PID 1, will properly handle signals. Most of the time this is a non-issue, however, several things can cause processes to enter Zombie states, like duplicated calls, improper error handling, nested calls, especially with bash.

In troubleshooting this issue, the first thing I would make sure of is that your application properly handles signals, and decide if you need to update the signal handling or even use a separate init process. Is your application or process created orphaned processes (processes that have lost connection to the parent process)?

profile pictureAWS
answered a year ago
profile picture
EXPERT
reviewed 10 months ago
  • Thanks for your response. Yes our application is an orphaned process and it was going into this state because we were terminate the parent application first.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions