Configuring auto-scaling for sagemaker async-inference


I've configured a model for async-inference, and its working correctly - I can submit a file via invoke_endpoint_async and download the output from s3.

I'm now trying to configure auto-scaling. I'm trying experimentation with different options, but basically I want to configure 0-1 instances, have an instance created wheninvoke_endpoint_async is called, and have the instance shutdown shortly afterwards (along the lines of batch inference)

I'm struggling to get it to work - I'm experiencing similar issues to

First I think there's an issue with the console - if I aws register-scalable-target ... it works but the console doesn't like the zero for min-capacity

Enter image description here

I think this is just a UI nit though, I don't understand how the policy works - I have

    "TargetValue": 1.0,
    "CustomizedMetricSpecification": {
        "MetricName": "ApproximateBacklogSizePerInstance",
        "Namespace": "AWS/SageMaker",
        "Dimensions": [{"Name": "EndpointName", "Value": "***-test-endpoint-2023-03-24-04-28-06-341"}],
        "Statistic": "Average"
    "ScaleInCooldown": 60,
    "ScaleOutCooldown": 60

The first point of confusion was the console shows a built-in and custom policy. I was initially using the name of the built-in policy (SageMakerEndpointInvocationScalingPolicy) but put-scaling-policy doesn't appear to edit it - it creates a new policy with the same name.

When I monitor the scaling activity ()

aws application-autoscaling describe-scaling-activities \
    --service-namespace sagemaker

I can initially see "Successfully set desired instance count to 0. Change successfully fulfilled by sagemaker."

But when I involve the endpoint with

response = sm_runtime.invoke_endpoint_async(

output_location = response['OutputLocation']

I would expect to see the instance count increase to 1, then back to zero within a space of a few minutes. I have occasionally got it to do something but not reliably. I think the main issue is I don't understand the metric and how it interacts with the target.

I've seen charts but I cannot figure out how to plot the "ApproximateBacklogSizePerInstance"? And how does it interact with "TargetValue"? What is the actual trigger for a scale in/out?

2 Answers
Accepted Answer

A target tracking scaling policy will create 2 CloudWatch alarms (one for high and one for low usage), which you'll be able to see in the CloudWatch alarms console. The high usage policy needs to have 3 consecutive 60 second breaching datapoints to trigger a scale-out; and the low alarm needs 15 consecutive 60 second breaching datapoints to scale-in

You may instead want to use step scaling policies, where you are able to create and control the alarms as well as the policy settings

answered a year ago

Thanks, once I learned that the policy is managed by CloudWatch alarms I was able to observe my policy working - in particular, I changed the metric to HasBacklogWithoutCapacity, the target to 0.5, and I used Maximum instead of average and it behaves as I require. I did notice that there's also some delay between when I submit an inference job to the queue, and when the metric increases from 0 to 1 (about 2-3 minutes) and then it waits for the 3 consecutive values so overall it takes about 5 minutes rather than 3 to trigger start add capacity. I'll try step scaling to reduce that but at least I have a working proof-of-concept to compare with batch-inference now.

answered a year ago
  • Glad you were able to get it at least mostly working! Just as an FYI, the Target Tracking scaling policy is expecting the alarm to be configured exactly as AutoScaling created it, so modifying the alarm can lead to unexpected behavior (most often, the alarm triggering but scaling may not happen). So any time you need the alarm customized, step scaling is the way to go unless its something which can be done natively in the CustomizedMetricSpecification section of PutScalingPolicy

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions