How to reduce latency of auto-scaling instances triggered by AWS batch jobs

0

I have a Batch EC2 managed compute environment, that I'm using to execute jobs. I have it configured with 0 minimum CPUs, and an allocation strategy of Spot price capacity optimized.

The goal is, any time a job is submitted, a new instance is created, which executes the job, and then immediately terminates.

The issue I'm having is, when a job is submitted, it takes a few minutes for the new instance to be created, and when the job completes, it takes another few minutes before the instance is terminated. When I look at the associated Auto Scaling group activity, I see that it's taking about 1 minute before it registers "An instance was started in response to a difference between desired and actual capacity".

How can I reduce the latency, such that the instance is created immediately after the job is created, and terminated immediately after it exits? Is there a better way to achieve this?

I see claims, that batch scaling occurs on fixed intervals (e.g. every 10 minutes), but I cant find any documentation on this.

Here is the compute env json

{
  "computeEnvironmentArn": "arn:aws:batch:....",
  "ecsClusterArn": "arn:aws:.....",
  "tags": {},
  "type": "MANAGED",
  "state": "ENABLED",
  "status": "VALID",
  "statusReason": "ComputeEnvironment Healthy",
  "computeResources": {
    "type": "SPOT",
    "allocationStrategy": "SPOT_PRICE_CAPACITY_OPTIMIZED",
    "minvCpus": 0,
    "maxvCpus": 256,
    "desiredvCpus": 0,
    "instanceTypes": [
      "g4dn.xlarge"
    ],
    "bidPercentage": 100,
    "launchTemplate": {
      "launchTemplateName": "....",
      "version": "$Default"
    },
  "containerOrchestrationType": "ECS"
}
2 Answers
0
Accepted Answer

I found a workaround. I created an unmanaged Batch compute environment, which creates an ECS cluster. I then created an ASG (Auto-Scaling Group), and set it as the capacity provider for that cluster.

For the ASG, I set min capacity and desired capacity to 0. I then created an EventBridge rule, that is triggered when the batch job status is changed, and executes a lambda when this happens.

In the lamabda, I increment/decrement the ASG DesiredCapacity property. This results in an instance being created/destroyed when the jobs is submitted/completes.

To get the instances to register with ECS after booting, I updated the launch template with this user data: #!/bin/bash echo ECS_CLUSTER="name_of_the_ecs_cluster" >> /etc/ecs/ecs.config

To ensure that the scaling down terminates the correct instance, the one with the terminated job, I enabled "instance scale-in protection" for all new instances, and then in the lamabda, before decrementing DesiredCapacity , I disable the instance protection for the container instance associated with the job.

Adam
answered 2 months ago
0

Hello,

I couldn't find a straight forward doc to reduce the latency. One of recommendation from EC2 docs is to configure Warm pool instances so that instances can be launched faster to reduce latency.

Hope this helps!

[1] https://ec2spotworkshops.com/efficient-and-resilient-ec2-auto-scaling/lab1/90-enable-warm-pool.html [2] https://docs.aws.amazon.com/autoscaling/ec2/userguide/examples-warm-pools-aws-cli.html

AWS
sai
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions