AWS Batch - jobs in RUNNABLE for several hours

0

I'm trying to understand what could cause AWS Batch jobs (using EC2 spot instances) to be at times seemingly stuck in "RUNNABLE" state for several hours, before finally getting picked up.

This annoying behaviour seems to come and go over time. Some days, the very same jobs, configured in the very same way, using the same queues and compute environments are almost immediately picked up and processed, and some days they will be in RUNNABLE status for a long time (recently I experienced 3 to 6 hours).

The usual troubleshooting documents don't help, as they seem to only cover cases where the job never gets picked up (due to configuration issue, or mismatch of vCPU/memory between job and compute environments). What I observe when I hit these issues is that there doesn't seem to be any spot request shown in the EC2 dashboard. The Spot instance pricing at that time, for the type of instance I need (32 vCPU, 64Gb memory) is not spiking (and I've set it to a limit of 100% of on-demand anyway). So one theory is that there is no spot instance available at all at that time but 1) that seems unlikely (I use eu-west-1) and 2) I can't find any way to validate that theory. My account limits on M and R instance types (the ones typically be used when the jobs are running) is very high (1000+), so that's also not the reason as far as I can tell.

Anyone with any theory and suggestion?

For now, my solution is to change the queue to add a compute environment with on-demand instances, but that more than doubles the price...

asked 2 years ago188 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions