Triggering an alert for jobs that have been stuck in runnable for too long

0

The Troubleshooting AWS Batch mentions the various reasons a job can be stuck in runnable. What I'm unsure about is how to trigger an alert when something is stuck in runnable for too long?

I have set up EventBridge rules to send SNS notifications when:

  • there is a Batch Job State Change of status FAILED (e.g. this tutorial) which works as expected
  • another rule for Batch Job Queue Blocked which I can't seem to trigger

The jobs remain in the RUNNABLE for hours even though I've also configured job queue state time limit actions to cancel jobs after the 10 minutes whenever the reasons CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY, MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE, or MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT status reasons appear?

I was hoping the job state limits would trigger after 10 minutes, cancel the job, which would then cause the Batch Job State Change of status FAILED rule to be triggered, and then alert my SNS topic.

Questions

  1. What is the best way to trigger an alert for jobs that have been stuck in runnable for too long?
  2. When exactly do the CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY, MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE, or MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT appear?

Things I've checked and other observations:

  • My jobs do run. This question is about setting up alerts on jobs that have been in RUNNABLE for too long because they're stuck behind another job or misconfigured job definitions. I'm testing this in a few ways:
    • Setting the compute environment to 2 vCPU, submitting an 1-hour job with 2 vCPU (consume all of the compute environment's capacity), and immediately submitting a second job. This second job is stuck behind the first and should be cancelled by the job state limit actions after 10 minutes but isn't (instead it waits for the first job to finish and then completes)
    • Submitting a job with 4 vCPU which can't run in the compute environment since it only has 2 vCPU
  • Inspecting the jobs with the aws batch describe-jobs --jobs <jobId> show reveals the jobs in RUNNABLE state with no statusReason
  • My compute environment doesn't specify a service role so that the managed service linked role is used which has the permissions to manage the job state limit actions
  • This isn't job timeouts attemptDurationSeconds parameter as that relates to jobs already in the RUNNING state. My jobs are stuck in RUNNABLE
  • I'm using Fargate spot instances if that makes any difference
Andrew
asked 11 days ago33 views
1 Answer
0

Hi,

To solve this issue, we created a watchdog Lambda scheduled via cron every 5 minutes. For this scheduling, see https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-run-lambda-schedule.html

This Lambda with list all jobs in Running state and kill those that have been running for too long.

To list jobs, get their details and cancel them, see:

Use the equivalent of this CLI command in your favorite language via corresponding SDK

Best,

Didier

profile pictureAWS
EXPERT
answered 11 days ago
profile picture
EXPERT
reviewed 4 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions