How to reduce latency of auto-scaling instances triggered by AWS batch jobs

0

I have a Batch EC2 managed compute environment, that I'm using to execute jobs. I have it configured with 0 minimum CPUs, and an allocation strategy of Spot price capacity optimized.

The goal is, any time a job is submitted, a new instance is created, which executes the job, and then immediately terminates.

The issue I'm having is, when a job is submitted, it takes a few minutes for the new instance to be created, and when the job completes, it takes another few minutes before the instance is terminated. When I look at the associated Auto Scaling group activity, I see that it's taking about 1 minute before it registers "An instance was started in response to a difference between desired and actual capacity".

How can I reduce the latency, such that the instance is created immediately after the job is created, and terminated immediately after it exits? Is there a better way to achieve this?

I see claims, that batch scaling occurs on fixed intervals (e.g. every 10 minutes), but I cant find any documentation on this.

Here is the compute env json

{
  "computeEnvironmentArn": "arn:aws:batch:....",
  "ecsClusterArn": "arn:aws:.....",
  "tags": {},
  "type": "MANAGED",
  "state": "ENABLED",
  "status": "VALID",
  "statusReason": "ComputeEnvironment Healthy",
  "computeResources": {
    "type": "SPOT",
    "allocationStrategy": "SPOT_PRICE_CAPACITY_OPTIMIZED",
    "minvCpus": 0,
    "maxvCpus": 256,
    "desiredvCpus": 0,
    "instanceTypes": [
      "g4dn.xlarge"
    ],
    "bidPercentage": 100,
    "launchTemplate": {
      "launchTemplateName": "....",
      "version": "$Default"
    },
  "containerOrchestrationType": "ECS"
}
Adam
已提問 2 個月前檢視次數 432 次
2 個答案
0
已接受的答案

I found a workaround. I created an unmanaged Batch compute environment, which creates an ECS cluster. I then created an ASG (Auto-Scaling Group), and set it as the capacity provider for that cluster.

For the ASG, I set min capacity and desired capacity to 0. I then created an EventBridge rule, that is triggered when the batch job status is changed, and executes a lambda when this happens.

In the lamabda, I increment/decrement the ASG DesiredCapacity property. This results in an instance being created/destroyed when the jobs is submitted/completes.

To get the instances to register with ECS after booting, I updated the launch template with this user data: #!/bin/bash echo ECS_CLUSTER="name_of_the_ecs_cluster" >> /etc/ecs/ecs.config

To ensure that the scaling down terminates the correct instance, the one with the terminated job, I enabled "instance scale-in protection" for all new instances, and then in the lamabda, before decrementing DesiredCapacity , I disable the instance protection for the container instance associated with the job.

Adam
已回答 2 個月前
0

Hello,

I couldn't find a straight forward doc to reduce the latency. One of recommendation from EC2 docs is to configure Warm pool instances so that instances can be launched faster to reduce latency.

Hope this helps!

[1] https://ec2spotworkshops.com/efficient-and-resilient-ec2-auto-scaling/lab1/90-enable-warm-pool.html [2] https://docs.aws.amazon.com/autoscaling/ec2/userguide/examples-warm-pools-aws-cli.html

AWS
sai
已回答 2 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南