Zombie pods in EKS

0

Hi all. we have recently started experiencing an increase in zombie pods when terminating them, is anyone aware what is the root cause of a pod being a zombie/stuck on terminating state? This is the error we keep on getting: error killing pod: failed to "KillContainer" for "zombie-pod" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: cannot stop container: 803b8598080nbdkau8i0n2526be67302a3748dbcbe3066ad0fae55707d1: container 803b8598080 PID 14597 is zombie and can not be killed. Use the --init option when creating containers to run an init inside the container that forwards signals and reaps processes"

1 回答
0
已接受的回答

Zombie pods are usually caused by containers that have Zombie processes that won't stop. If you've recently experienced more of these than usual, then I would look at what has changed in the applications/processes that you are running in the containers. The --init option is a Docker setting, that sets the ENTRYPOINT to tini. This is an init processes that becomes PID 1, and then handles your apps as process children. This is usually done when signals (SIGTERM, SIGKILL) are not being properly handled by applications. There was another option with dumb-init from Yelp.

It is always a good idea to make sure that processes, especially PID 1, will properly handle signals. Most of the time this is a non-issue, however, several things can cause processes to enter Zombie states, like duplicated calls, improper error handling, nested calls, especially with bash.

In troubleshooting this issue, the first thing I would make sure of is that your application properly handles signals, and decide if you need to update the signal handling or even use a separate init process. Is your application or process created orphaned processes (processes that have lost connection to the parent process)?

profile pictureAWS
已回答 1 年前
profile picture
专家
已审核 10 个月前
  • Thanks for your response. Yes our application is an orphaned process and it was going into this state because we were terminate the parent application first.

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则