Start up times for AWS Batch jobs with big outliers

0

We use Batch to start jobs automatically on g4dn.2xlarge EC2 instances (On-Demand due to our business requirements). Usually the start up times (time from state RUNNABLE to RUNNING) are around 6 minutes, but sometimes there are outliers, where we have to wait more than 1:30 hours for the job to be started.

What is causing this delay?

Can we do something about it?

  • The question AWS Batch - jobs in RUNNABLE for several hours seems to be at least related, but is focused on spot instances and unfortunately got no answers since a year :-(

  • My first bet is that there are insufficient resources, so no EC2 instances available of that type in the region. We started multiple jobs in parallel and could not achieve more than 8 instances running in parallel for region eu-central-1. If that is really the current maximum we have to go to another cloud provider I guess.

Nico
질문됨 4달 전219회 조회
1개 답변
0

There are a few potential factors that could be causing the occasional long start times for AWS Batch jobs:

  • Instance limits - If there are insufficient On-Demand g4dn.2xlarge instances available in the region, it will take time for new ones to spin up. You can check CloudWatch metrics for instance usage and limits.

  • Scheduling - If multiple jobs are queued at once, it can take time for the scheduler to place them all on instances. The scheduling policy may prioritize smaller jobs.

  • AMI caching - Batch caches AMIs on hosts. If an AMI is evicted, it will need to be re-cached before launching new jobs, adding delay.

  • Host initialization - Sometimes when hosts start they need time to initialize resources, security groups, VPC settings etc before accepting jobs.

Some things you could try:

  • Request an On-Demand instance limit increase for that type.

  • Use a ComputeEnvironment with a higher priority to get access to instances sooner.

  • Switch to Spot Instances to increase available capacity. Can use a mix of Spot + On-Demand.

  • Try spreading job submissions over time to reduce scheduling bursts.

  • Check Batch metrics in CloudWatch for queue depth, scheduling latency, etc.

  • Ensure latest AWS Batch agent is installed on AMIs.

The outliers are likely due to bursting above capacity and waits for new hosts. Increasing instance limits and supplementing with Spot should help smooth things out.

AWS
Saad
답변함 4달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인