Sagemaker Async endpoint autoscaling - how to make it work well?

0

Hi There, I'm in trouble with autoscaling related to Sagemaker Async Endpoint. In Particular, I have 3 cloudwatch alarms that trigger the scaling policy:

  • ApproximateBacklogSizePerInstance < 4.5 for 15 datapoints within 15 minutes
  • ApproximateBacklogSizePerInstance > 5 for 3 datapoints within 3 minutes
  • HasBacklogWithoutCapacity >= 1 for 1 datapoints within 1 minute

When the scaling out happens, my endpoint remains stucked in updating status until all the processing of enqueued messages is over. This implies in some errors like:

  • The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch log for this endpoint. -> Because of this error, several instances are created and destroyed (I checked in cloudTrail history actions). This results in some messages being "stranded" during processing, which unfortunately is not completed anymore. This only resumes when another scaling IN action is performed. (If, for example, I have 50 messages to process, my endpoint will reach the configured maximum number of instances and produce the above error. Than, it will not process some messages. Processing of these will only resume when a scaling IN is performed).

  • Received error: "Resource endpoint/endpoint_name/variant/AllTraffic was not in a scalable state due to reason: Endpoint X is in Updating state, should be in InService state to scale.";

Moreover, when the scaling in to 0 instances is performed and there are some messages that are being processed (in a number less than our threshold), some of them remains "stranded" (similar to the first point) and are no longer processed (no errors are produced). Only when another scale activity is performed, these messages becomes again "visible" and so processed (it is as if there was a visibility timeout for the queue inside the Sagemaker Endpoint).

How could I solve these problems?? Seems to be a bug.

The python code I'm using to create the Endpoint is the following:

model = TensorFlowModel(
    source_dir=f'../models/{model_name}/code',
    entry_point='inference.py',
    sagemaker_session=session,
    model_data=model_data_tar_gz, 
    role=tf_model_role, 
    image_uri=image_uri,
    name=model_sg_name,
    code_location=final_model_output,
    env={
        
        'OMP_NUM_THREADS': '1',
        'SAGEMAKER_TFS_INTRA_OP_PARALLELISM': '1', # Setting this environment variable to the number of available physical cores is recommended. (g5.4x -> 16)

    }
    ,
    vpc_config={
        "Subnets": subnets,
        "SecurityGroupIds": security_groups
    }
)
predictor = model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    async_inference_config=AsyncInferenceConfig(
        output_path=out_inference,
        max_concurrent_invocations_per_instance=max_invocation_instance,
        notification_config={
            "SuccessTopic": success_topic,
            "ErrorTopic": error_topic
        }
    )
)


client = boto3.client('application-autoscaling', **auth_kwargs)
response = client.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=scaling_min,
    MaxCapacity=scaling_max
)


response = client.put_scaling_policy(
    PolicyName='Invocations-ScalingPolicy',
    ServiceNamespace='sagemaker', 
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': queue_target_value,
        'CustomizedMetricSpecification': {
            'MetricName': 'ApproximateBacklogSize',
            'Namespace': 'AWS/SageMaker',
            'Dimensions': [{'Name': 'EndpointName', 'Value': endpoint_name}],
            'Statistic': 'Average',
        },
        'ScaleInCooldown': 300
        'ScaleOutCooldown': 120
    }
)

response_scaling = client.put_scaling_policy(
    PolicyName="HasBacklogWithoutCapacity-ScalingPolicy",
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="StepScaling",
    StepScalingPolicyConfiguration={
        "AdjustmentType": "ChangeInCapacity",
        "MetricAggregationType": "Average",
        "Cooldown": 300,
        "StepAdjustments":[ 
            {
              "MetricIntervalLowerBound": 0,
              "ScalingAdjustment": 1
            }
        ]
    } 
)
cw_client = boto3.client('cloudwatch', **auth_kwargs)
response = cw_client.put_metric_alarm(
    AlarmName=f"{endpoint_name}/Backlog-without-capacity",
    MetricName='HasBacklogWithoutCapacity',
    Namespace='AWS/SageMaker',
    Statistic='Average',
    EvaluationPeriods= 1,
    DatapointsToAlarm= 1,
    Threshold= 1,
    ComparisonOperator='GreaterThanOrEqualToThreshold',
    TreatMissingData='missing',
    Dimensions=[{ 'Name':'EndpointName', 'Value': endpoint_name }],
    Period= 60,
    AlarmActions=[response_scaling['PolicyARN']]
)
asked a year ago155 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions