AWS Batch - jobs in RUNNABLE for several hours

0

I'm trying to understand what could cause AWS Batch jobs (using EC2 spot instances) to be at times seemingly stuck in "RUNNABLE" state for several hours, before finally getting picked up.

This annoying behaviour seems to come and go over time. Some days, the very same jobs, configured in the very same way, using the same queues and compute environments are almost immediately picked up and processed, and some days they will be in RUNNABLE status for a long time (recently I experienced 3 to 6 hours).

The usual troubleshooting documents don't help, as they seem to only cover cases where the job never gets picked up (due to configuration issue, or mismatch of vCPU/memory between job and compute environments). What I observe when I hit these issues is that there doesn't seem to be any spot request shown in the EC2 dashboard. The Spot instance pricing at that time, for the type of instance I need (32 vCPU, 64Gb memory) is not spiking (and I've set it to a limit of 100% of on-demand anyway). So one theory is that there is no spot instance available at all at that time but 1) that seems unlikely (I use eu-west-1) and 2) I can't find any way to validate that theory. My account limits on M and R instance types (the ones typically be used when the jobs are running) is very high (1000+), so that's also not the reason as far as I can tell.

Anyone with any theory and suggestion?

For now, my solution is to change the queue to add a compute environment with on-demand instances, but that more than doubles the price...

已提問 2 年前檢視次數 177 次
沒有答案

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南