I am setting up autoscaling for a realtime inference endpoint in sagemaker. I set up a load test using locust, and by setting relatively high numbers (i.e: 100 users, with 10 user spawned per seconds) I can see on Cloudwatch the InvocationPerInstance metric to ramp up pretty quickly to 20 000. I set InvocationPerInstance to 'sum', with period of '1 minute' on Cloudwatch.
Then, I created an autoscaling policy for my endpoint, with the following settings:
- SageMakerVariantInvocationsPerInstance → target value: 1500
- Scale in cool down seconds → 100
- Scale out cool down second → 10
By this, I would expect that in the moment in which (the sum) of InvocationPerInstance is greater than 1500, it would activate scale out. And it works, but with a significant delay, i.e. the metric is over 20 000 for more than 5 minutes before the scale out happens. It is even more delayed for scale in: when I stop the test, so with 0 InvocationPerInstance , only after more than 25 minutes the scale in happens.
See the graph below to see the delay in the scale out:
Why is it so delayed, is this an expected behaviour? Am I doing something wrong in the way I calculate the metrics perhaps?
Thank you so much! Really appreciate your help and guidance!
EDIT: I checked cloudwatch alarms, and I can see that:
- For scale out the threshold is → InvocationsPerInstance > 1500 for 3 datapoints within 3 minutes
- For scale in the threhsold is → InvocationsPerInstance < 1350 for 15 datapoints within 15 minutes
So this appears to be the issue. Is there a way to change these minutes?
This is the way I add my policy:
def set_target_scaling_on_invocation(
endpoint_name: str,
variant_name: str,
target_value: int,
scale_out_cool_down: int = 10,
scale_in_cool_down: int = 100,
) -> dict:
Set scaling target based on invocation per instance with cool-down periods
endpoint_name : str
The name of the endpoint
variant_name : str
The name of the endpoint variant
target_value : int
The target value for scaling based on invocations per instance
scale_out_cool_down : int, optional
The cool-down period for scaling out in seconds, by default 10
scale_in_cool_down : int, optional
The cool-down period for scaling in in seconds, by default 100
The policy name and the response from the scaling policy creation
policy_name = f"target-tracking-invocations-{round(time.time())}"
resource_id = f"endpoint/{endpoint_name}/variant/{variant_name}"
response = aas_client.put_scaling_policy(
"TargetValue": target_value,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance",
"ScaleOutCooldown": scale_out_cool_down,
"ScaleInCooldown": scale_in_cool_down,
"DisableScaleIn": False,
return policy_name, response
Thank you so much! It's clear now, appreciate a lot!