Skip to content

Best Practices for Fast Scaling SageMaker Async Endpoint from 0 to 1 Instance for Single Requests

0

Hello everyone,

I'm currently working on an AWS SageMaker project where I need to scale an async endpoint from 0 to 1 instance to handle a single request. The goal is to maintain cost efficiency by ensuring instances don't run when there are no incoming requests. However, I'm facing issues with the timing of scaling up when a request arrives, leading to latency problems.

Current Setup:

Using SageMaker async endpoints Auto-scaling policies configured based on the HasBacklog metric Experiencing latency when scaling from 0 to 1 instance Here is the code snippet I am using to configure the auto-scaling policy:

scaling_policy = application_scaling.put_scaling_policy(
    PolicyName=policy_name,
    ServiceNamespace="sagemaker",  # The namespace of the service that provides the resource.
    ResourceId=resource_id,  # Endpoint name
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",  
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 1.0,
        'CustomizedMetricSpecification': {
            'MetricName': 'HasBacklogWithoutCapacity',
            'Namespace': 'AWS/SageMaker',
            'Dimensions': [
                {'Name': 'EndpointName', 'Value': endpoint_name}
            ],
            'Statistic': 'Average',
        },
        'ScaleInCooldown': 60,
        'ScaleOutCooldown': 10,
    }
)

I also tried StepScaling, here's the code snippet:

scaleout_policy_name = f"{endpoint_name}-ScalingoutPolicy"

scalingout_policy = application_scaling.put_scaling_policy(
    PolicyName=scaleout_policy_name,
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="StepScaling",
    StepScalingPolicyConfiguration={
        'AdjustmentType': 'ChangeInCapacity',
        'StepAdjustments': [
            {
                'MetricIntervalLowerBound': 0,
                # 'MetricIntervalUpperBound': 10.0,
                'ScalingAdjustment': 1  
            },
        ],
        'Cooldown': 30, 
        'MetricAggregationType': 'Average'
    }
)

scaleout_alarm_name = f"{endpoint_name}-ScalingoutAlarm"
scale_out_policy_arn = scalingout_policy['PolicyARN']

scalingout_alarm = cloudwatch_client.put_metric_alarm(
    AlarmName=scaleout_alarm_name,
    MetricName='HasBacklogWithoutCapacity',
    Namespace='AWS/SageMaker',
    Statistic='Average',
    Period=60,  # Period in seconds
    EvaluationPeriods=1,
    Threshold=1.0, 
    ComparisonOperator='GreaterThanOrEqualToThreshold',
    Dimensions=[
        {
            'Name': 'EndpointName',
            'Value': endpoint_name
        },
        {
            'Name': 'VariantName',
            'Value': 'AllTraffic'
        }
    ],
    AlarmActions=[
        scale_out_policy_arn
    ],
    AlarmDescription='Alarm for scaling out SageMaker endpoint instances',
    Unit='Count'
)

Can anyone suggest a better way or provide advice on reducing latency when switching from 0 instances to 1 for a single request with async SageMaker? Is it possible to speed up the scaling process with some specific configurations or optimizations? Any help would be greatly appreciated

1 Answer
1

What kind of latency are you looking for? The step scaling policy is a single minute, which is essentially as fast as you get with CloudWatch alarms (unless you're pushing custom metrics).

Are you in control of the application sending the request? If so, could you just manually set the capacity to 1 at the same time as the request is submitted?

AWS
EXPERT
answered a year ago
  • As soon as the request comes, I want to scale my Sagemaker Async Endpoint to 1 from 0, and yes, I can manually send the request. Also, can you share code snippet for this? That will be helpful

  • I don't, but it looks like you would send update-weights-and-capacities to set the DesiredInstanceCount. If you expect it to always be 0 or 1, then just always set it to 1 when submitting a job. If it can go more than 1, the logic will be more complicated and you're largely re-creating AutoScaling. At that point, you should probably focus on the broader architecture/business requirements to see if the 1 minute delay from Step Scaling is actually an issue, or if that can be worked around

  • Thank you for your response and the helpful insights. Yes, I do expect the instance count to always be 0 and only change to 1 when requests arrive at the async endpoint. Unfortunately, from a business requirement perspective, the 1-minute delay is indeed problematic for our use case. I appreciate your suggestion about using update-weights-and-capacities to set the DesiredInstanceCount. Could you provide a code snippet or some more detailed guidance on how to implement this manually? This would be incredibly helpful. Thank you again for your assistance!

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.