The Troubleshooting AWS Batch mentions the various reasons a job can be stuck in runnable. What I'm unsure about is how to trigger an alert when something is stuck in runnable for too long?
I have set up EventBridge rules to send SNS notifications when:
- there is a
Batch Job State Change
of status FAILED
(e.g. this tutorial) which works as expected
- another rule for
Batch Job Queue Blocked
which I can't seem to trigger
The jobs remain in the RUNNABLE
for hours even though I've also configured job queue state time limit actions to cancel jobs after the 10 minutes whenever the reasons CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY
, MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE
, or MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT
status reasons appear?
I was hoping the job state limits would trigger after 10 minutes, cancel the job, which would then cause the Batch Job State Change
of status FAILED
rule to be triggered, and then alert my SNS topic.
Questions
- What is the best way to trigger an alert for jobs that have been stuck in runnable for too long?
- When exactly do the
CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY
, MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE
, or MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT
appear?
Things I've checked and other observations:
- My jobs do run. This question is about setting up alerts on jobs that have been in
RUNNABLE
for too long because they're stuck behind another job or misconfigured job definitions. I'm testing this in a few ways:
- Setting the compute environment to 2 vCPU, submitting an 1-hour job with 2 vCPU (consume all of the compute environment's capacity), and immediately submitting a second job. This second job is stuck behind the first and should be cancelled by the job state limit actions after 10 minutes but isn't (instead it waits for the first job to finish and then completes)
- Submitting a job with 4 vCPU which can't run in the compute environment since it only has 2 vCPU
- Inspecting the jobs with the
aws batch describe-jobs --jobs <jobId>
show reveals the jobs in RUNNABLE
state with no statusReason
- My compute environment doesn't specify a service role so that the managed service linked role is used which has the permissions to manage the job state limit actions
- This isn't job timeouts
attemptDurationSeconds
parameter as that relates to jobs already in the RUNNING
state. My jobs are stuck in RUNNABLE
- I'm using Fargate spot instances if that makes any difference
Via a support case, transition of jobs because of "CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY" or "MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE" only applies to EC2 jobs, not Fargate. This would require action on the AWS side to resolve. In the meantime, Didier's answers remains the most viable way to detect this situation.