- Newest
- Most votes
- Most comments
This issue with completed cronjob pods not being cleaned up in EKS can be frustrating and potentially impact cluster performance. Here are some steps to help debug and potentially resolve the issue:
-
Check the TTL Controller: Ensure that the Time-to-Live (TTL) controller is functioning correctly. The TTL controller is responsible for automatically deleting completed or failed job pods after a specified time. You can add a
ttlSecondsAfterFinished
field to your job specifications to enable this feature if it's not already in use. -
Verify Kubernetes version: Ensure you're running a supported version of Kubernetes on your EKS cluster. Some issues with job cleanup have been resolved in newer versions.
-
Check cluster resources: Verify that your cluster has sufficient resources. If the cluster is overloaded, it might struggle to perform cleanup tasks effectively.
-
Examine the cronjob controller logs: Look for any error messages or unusual behavior in the logs of the cronjob controller. This might provide insights into why the cleanup is not occurring.
-
Review cluster-wide policies: Check if there are any cluster-wide policies or admission controllers that might be interfering with the deletion process.
-
Inspect the kube-controller-manager: The kube-controller-manager is responsible for running controllers, including the job controller. Check its logs for any relevant error messages.
-
Manual cleanup: As a temporary measure, you can manually delete the completed pods. However, this doesn't address the root cause.
-
AWS Support: If the issue persists after trying these steps, it may be worth contacting AWS support, as there could be an underlying issue with the EKS service itself.
Remember, when jobs are marked as Completed or Failed, the related pods can continue to exist, which is normal behavior to allow for log and result viewing. However, in a properly functioning cluster, these should be cleaned up based on the history limits you've set.
If the issue started happening suddenly across multiple clusters, it's possible that a recent change in the EKS environment or a Kubernetes update might be responsible. Keep an eye on the EKS release notes and known issues for any relevant information.
Sources
Simplify compute management with AWS Fargate - Amazon EKS
Workloads - Amazon EKS
Relevant content
- asked 2 years ago
Nice AI, but EKS apparently doesn't support the TTL controller.