Start up times for AWS Batch jobs with big outliers

0

We use Batch to start jobs automatically on g4dn.2xlarge EC2 instances (On-Demand due to our business requirements). Usually the start up times (time from state RUNNABLE to RUNNING) are around 6 minutes, but sometimes there are outliers, where we have to wait more than 1:30 hours for the job to be started.

What is causing this delay?

Can we do something about it?

  • The question AWS Batch - jobs in RUNNABLE for several hours seems to be at least related, but is focused on spot instances and unfortunately got no answers since a year :-(

  • My first bet is that there are insufficient resources, so no EC2 instances available of that type in the region. We started multiple jobs in parallel and could not achieve more than 8 instances running in parallel for region eu-central-1. If that is really the current maximum we have to go to another cloud provider I guess.

Nico
asked 4 months ago204 views
1 Answer
0

There are a few potential factors that could be causing the occasional long start times for AWS Batch jobs:

  • Instance limits - If there are insufficient On-Demand g4dn.2xlarge instances available in the region, it will take time for new ones to spin up. You can check CloudWatch metrics for instance usage and limits.

  • Scheduling - If multiple jobs are queued at once, it can take time for the scheduler to place them all on instances. The scheduling policy may prioritize smaller jobs.

  • AMI caching - Batch caches AMIs on hosts. If an AMI is evicted, it will need to be re-cached before launching new jobs, adding delay.

  • Host initialization - Sometimes when hosts start they need time to initialize resources, security groups, VPC settings etc before accepting jobs.

Some things you could try:

  • Request an On-Demand instance limit increase for that type.

  • Use a ComputeEnvironment with a higher priority to get access to instances sooner.

  • Switch to Spot Instances to increase available capacity. Can use a mix of Spot + On-Demand.

  • Try spreading job submissions over time to reduce scheduling bursts.

  • Check Batch metrics in CloudWatch for queue depth, scheduling latency, etc.

  • Ensure latest AWS Batch agent is installed on AMIs.

The outliers are likely due to bursting above capacity and waits for new hosts. Increasing instance limits and supplementing with Spot should help smooth things out.

AWS
Saad
answered 3 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions