Triggering an alert for jobs that have been stuck in runnable for too long

0

The Troubleshooting AWS Batch mentions the various reasons a job can be stuck in runnable. What I'm unsure about is how to trigger an alert when something is stuck in runnable for too long?

I have set up EventBridge rules to send SNS notifications when:

  • there is a Batch Job State Change of status FAILED (e.g. this tutorial) which works as expected
  • another rule for Batch Job Queue Blocked which I can't seem to trigger

The jobs remain in the RUNNABLE for hours even though I've also configured job queue state time limit actions to cancel jobs after the 10 minutes whenever the reasons CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY, MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE, or MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT status reasons appear?

I was hoping the job state limits would trigger after 10 minutes, cancel the job, which would then cause the Batch Job State Change of status FAILED rule to be triggered, and then alert my SNS topic.

Questions

  1. What is the best way to trigger an alert for jobs that have been stuck in runnable for too long?
  2. When exactly do the CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY, MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE, or MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT appear?

Things I've checked and other observations:

  • My jobs do run. This question is about setting up alerts on jobs that have been in RUNNABLE for too long because they're stuck behind another job or misconfigured job definitions. I'm testing this in a few ways:
    • Setting the compute environment to 2 vCPU, submitting an 1-hour job with 2 vCPU (consume all of the compute environment's capacity), and immediately submitting a second job. This second job is stuck behind the first and should be cancelled by the job state limit actions after 10 minutes but isn't (instead it waits for the first job to finish and then completes)
    • Submitting a job with 4 vCPU which can't run in the compute environment since it only has 2 vCPU
  • Inspecting the jobs with the aws batch describe-jobs --jobs <jobId> show reveals the jobs in RUNNABLE state with no statusReason
  • My compute environment doesn't specify a service role so that the managed service linked role is used which has the permissions to manage the job state limit actions
  • This isn't job timeouts attemptDurationSeconds parameter as that relates to jobs already in the RUNNING state. My jobs are stuck in RUNNABLE
  • I'm using Fargate spot instances if that makes any difference
  • Via a support case, transition of jobs because of "CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY" or "MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE" only applies to EC2 jobs, not Fargate. This would require action on the AWS side to resolve. In the meantime, Didier's answers remains the most viable way to detect this situation.

Andrew
質問済み 2ヶ月前73ビュー
1回答
1
承認された回答

Hi,

To solve this issue, we created a watchdog Lambda scheduled via cron every 5 minutes. For this scheduling, see https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-run-lambda-schedule.html

This Lambda with list all jobs in Running state and kill those that have been running for too long.

To list jobs, get their details and cancel them, see:

Use the equivalent of this CLI command in your favorite language via corresponding SDK

Best,

Didier

profile pictureAWS
エキスパート
回答済み 2ヶ月前
profile picture
エキスパート
レビュー済み 1ヶ月前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ