EKS pods for completed cronjobs are no longer cleaned up

0

Since a few days the pods for our cronjobs are no longer cleaned up by the cronjob-controller. It started happening on multiple clusters and now we have thousands of pods in Completed state. The pods do not have any finalizer associated with them. The jobs all have failedJobsHistoryLimit and successfulJobsHistoryLimit set. Recreating the jobs did not help. And it used to work fine. Any tips on how to debug this issue?

asked a month ago46 views
1 Answer
0

This issue with completed cronjob pods not being cleaned up in EKS can be frustrating and potentially impact cluster performance. Here are some steps to help debug and potentially resolve the issue:

  1. Check the TTL Controller: Ensure that the Time-to-Live (TTL) controller is functioning correctly. The TTL controller is responsible for automatically deleting completed or failed job pods after a specified time. You can add a ttlSecondsAfterFinished field to your job specifications to enable this feature if it's not already in use.

  2. Verify Kubernetes version: Ensure you're running a supported version of Kubernetes on your EKS cluster. Some issues with job cleanup have been resolved in newer versions.

  3. Check cluster resources: Verify that your cluster has sufficient resources. If the cluster is overloaded, it might struggle to perform cleanup tasks effectively.

  4. Examine the cronjob controller logs: Look for any error messages or unusual behavior in the logs of the cronjob controller. This might provide insights into why the cleanup is not occurring.

  5. Review cluster-wide policies: Check if there are any cluster-wide policies or admission controllers that might be interfering with the deletion process.

  6. Inspect the kube-controller-manager: The kube-controller-manager is responsible for running controllers, including the job controller. Check its logs for any relevant error messages.

  7. Manual cleanup: As a temporary measure, you can manually delete the completed pods. However, this doesn't address the root cause.

  8. AWS Support: If the issue persists after trying these steps, it may be worth contacting AWS support, as there could be an underlying issue with the EKS service itself.

Remember, when jobs are marked as Completed or Failed, the related pods can continue to exist, which is normal behavior to allow for log and result viewing. However, in a properly functioning cluster, these should be cleaned up based on the history limits you've set.

If the issue started happening suddenly across multiple clusters, it's possible that a recent change in the EKS environment or a Kubernetes update might be responsible. Keep an eye on the EKS release notes and known issues for any relevant information.

Sources
Simplify compute management with AWS Fargate - Amazon EKS
Workloads - Amazon EKS

profile picture
answered a month ago
  • Nice AI, but EKS apparently doesn't support the TTL controller.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions