Triggering an alert for jobs that have been stuck in runnable for too long

0

The Troubleshooting AWS Batch mentions the various reasons a job can be stuck in runnable. What I'm unsure about is how to trigger an alert when something is stuck in runnable for too long?

I have set up EventBridge rules to send SNS notifications when:

  • there is a Batch Job State Change of status FAILED (e.g. this tutorial) which works as expected
  • another rule for Batch Job Queue Blocked which I can't seem to trigger

The jobs remain in the RUNNABLE for hours even though I've also configured job queue state time limit actions to cancel jobs after the 10 minutes whenever the reasons CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY, MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE, or MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT status reasons appear?

I was hoping the job state limits would trigger after 10 minutes, cancel the job, which would then cause the Batch Job State Change of status FAILED rule to be triggered, and then alert my SNS topic.

Questions

  1. What is the best way to trigger an alert for jobs that have been stuck in runnable for too long?
  2. When exactly do the CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY, MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE, or MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT appear?

Things I've checked and other observations:

  • My jobs do run. This question is about setting up alerts on jobs that have been in RUNNABLE for too long because they're stuck behind another job or misconfigured job definitions. I'm testing this in a few ways:
    • Setting the compute environment to 2 vCPU, submitting an 1-hour job with 2 vCPU (consume all of the compute environment's capacity), and immediately submitting a second job. This second job is stuck behind the first and should be cancelled by the job state limit actions after 10 minutes but isn't (instead it waits for the first job to finish and then completes)
    • Submitting a job with 4 vCPU which can't run in the compute environment since it only has 2 vCPU
  • Inspecting the jobs with the aws batch describe-jobs --jobs <jobId> show reveals the jobs in RUNNABLE state with no statusReason
  • My compute environment doesn't specify a service role so that the managed service linked role is used which has the permissions to manage the job state limit actions
  • This isn't job timeouts attemptDurationSeconds parameter as that relates to jobs already in the RUNNING state. My jobs are stuck in RUNNABLE
  • I'm using Fargate spot instances if that makes any difference
  • Via a support case, transition of jobs because of "CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY" or "MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE" only applies to EC2 jobs, not Fargate. This would require action on the AWS side to resolve. In the meantime, Didier's answers remains the most viable way to detect this situation.

Andrew
已提問 2 個月前檢視次數 73 次
1 個回答
1
已接受的答案

Hi,

To solve this issue, we created a watchdog Lambda scheduled via cron every 5 minutes. For this scheduling, see https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-run-lambda-schedule.html

This Lambda with list all jobs in Running state and kill those that have been running for too long.

To list jobs, get their details and cancel them, see:

Use the equivalent of this CLI command in your favorite language via corresponding SDK

Best,

Didier

profile pictureAWS
專家
已回答 2 個月前
profile picture
專家
已審閱 1 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南