Hello everyone,
I'm currently working on an AWS SageMaker project where I need to scale an async endpoint from 0 to 1 instance to handle a single request. The goal is to maintain cost efficiency by ensuring instances don't run when there are no incoming requests. However, I'm facing issues with the timing of scaling up when a request arrives, leading to latency problems.
Current Setup:
Using SageMaker async endpoints
Auto-scaling policies configured based on the HasBacklog metric
Experiencing latency when scaling from 0 to 1 instance
Here is the code snippet I am using to configure the auto-scaling policy:
scaling_policy = application_scaling.put_scaling_policy(
PolicyName=policy_name,
ServiceNamespace="sagemaker", # The namespace of the service that provides the resource.
ResourceId=resource_id, # Endpoint name
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
PolicyType="TargetTrackingScaling",
TargetTrackingScalingPolicyConfiguration={
'TargetValue': 1.0,
'CustomizedMetricSpecification': {
'MetricName': 'HasBacklogWithoutCapacity',
'Namespace': 'AWS/SageMaker',
'Dimensions': [
{'Name': 'EndpointName', 'Value': endpoint_name}
],
'Statistic': 'Average',
},
'ScaleInCooldown': 60,
'ScaleOutCooldown': 10,
}
)
I also tried StepScaling, here's the code snippet:
scaleout_policy_name = f"{endpoint_name}-ScalingoutPolicy"
scalingout_policy = application_scaling.put_scaling_policy(
PolicyName=scaleout_policy_name,
ServiceNamespace="sagemaker",
ResourceId=resource_id,
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
PolicyType="StepScaling",
StepScalingPolicyConfiguration={
'AdjustmentType': 'ChangeInCapacity',
'StepAdjustments': [
{
'MetricIntervalLowerBound': 0,
# 'MetricIntervalUpperBound': 10.0,
'ScalingAdjustment': 1
},
],
'Cooldown': 30,
'MetricAggregationType': 'Average'
}
)
scaleout_alarm_name = f"{endpoint_name}-ScalingoutAlarm"
scale_out_policy_arn = scalingout_policy['PolicyARN']
scalingout_alarm = cloudwatch_client.put_metric_alarm(
AlarmName=scaleout_alarm_name,
MetricName='HasBacklogWithoutCapacity',
Namespace='AWS/SageMaker',
Statistic='Average',
Period=60, # Period in seconds
EvaluationPeriods=1,
Threshold=1.0,
ComparisonOperator='GreaterThanOrEqualToThreshold',
Dimensions=[
{
'Name': 'EndpointName',
'Value': endpoint_name
},
{
'Name': 'VariantName',
'Value': 'AllTraffic'
}
],
AlarmActions=[
scale_out_policy_arn
],
AlarmDescription='Alarm for scaling out SageMaker endpoint instances',
Unit='Count'
)
Can anyone suggest a better way or provide advice on reducing latency when switching from 0 instances to 1 for a single request with async SageMaker? Is it possible to speed up the scaling process with some specific configurations or optimizations? Any help would be greatly appreciated
As soon as the request comes, I want to scale my Sagemaker Async Endpoint to 1 from 0, and yes, I can manually send the request. Also, can you share code snippet for this? That will be helpful
I don't, but it looks like you would send update-weights-and-capacities to set the DesiredInstanceCount. If you expect it to always be 0 or 1, then just always set it to 1 when submitting a job. If it can go more than 1, the logic will be more complicated and you're largely re-creating AutoScaling. At that point, you should probably focus on the broader architecture/business requirements to see if the 1 minute delay from Step Scaling is actually an issue, or if that can be worked around
Thank you for your response and the helpful insights. Yes, I do expect the instance count to always be 0 and only change to 1 when requests arrive at the async endpoint. Unfortunately, from a business requirement perspective, the 1-minute delay is indeed problematic for our use case. I appreciate your suggestion about using update-weights-and-capacities to set the DesiredInstanceCount. Could you provide a code snippet or some more detailed guidance on how to implement this manually? This would be incredibly helpful. Thank you again for your assistance!