Sagemaker Async endpoint autoscaling - how to make it work well?

0

Hi There, I'm in trouble with autoscaling related to Sagemaker Async Endpoint. In Particular, I have 3 cloudwatch alarms that trigger the scaling policy:

  • ApproximateBacklogSizePerInstance < 4.5 for 15 datapoints within 15 minutes
  • ApproximateBacklogSizePerInstance > 5 for 3 datapoints within 3 minutes
  • HasBacklogWithoutCapacity >= 1 for 1 datapoints within 1 minute

When the scaling out happens, my endpoint remains stucked in updating status until all the processing of enqueued messages is over. This implies in some errors like:

  • The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch log for this endpoint. -> Because of this error, several instances are created and destroyed (I checked in cloudTrail history actions). This results in some messages being "stranded" during processing, which unfortunately is not completed anymore. This only resumes when another scaling IN action is performed. (If, for example, I have 50 messages to process, my endpoint will reach the configured maximum number of instances and produce the above error. Than, it will not process some messages. Processing of these will only resume when a scaling IN is performed).

  • Received error: "Resource endpoint/endpoint_name/variant/AllTraffic was not in a scalable state due to reason: Endpoint X is in Updating state, should be in InService state to scale.";

Moreover, when the scaling in to 0 instances is performed and there are some messages that are being processed (in a number less than our threshold), some of them remains "stranded" (similar to the first point) and are no longer processed (no errors are produced). Only when another scale activity is performed, these messages becomes again "visible" and so processed (it is as if there was a visibility timeout for the queue inside the Sagemaker Endpoint).

How could I solve these problems?? Seems to be a bug.

The python code I'm using to create the Endpoint is the following:

model = TensorFlowModel(
    source_dir=f'../models/{model_name}/code',
    entry_point='inference.py',
    sagemaker_session=session,
    model_data=model_data_tar_gz, 
    role=tf_model_role, 
    image_uri=image_uri,
    name=model_sg_name,
    code_location=final_model_output,
    env={
        
        'OMP_NUM_THREADS': '1',
        'SAGEMAKER_TFS_INTRA_OP_PARALLELISM': '1', # Setting this environment variable to the number of available physical cores is recommended. (g5.4x -> 16)

    }
    ,
    vpc_config={
        "Subnets": subnets,
        "SecurityGroupIds": security_groups
    }
)
predictor = model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    async_inference_config=AsyncInferenceConfig(
        output_path=out_inference,
        max_concurrent_invocations_per_instance=max_invocation_instance,
        notification_config={
            "SuccessTopic": success_topic,
            "ErrorTopic": error_topic
        }
    )
)


client = boto3.client('application-autoscaling', **auth_kwargs)
response = client.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=scaling_min,
    MaxCapacity=scaling_max
)


response = client.put_scaling_policy(
    PolicyName='Invocations-ScalingPolicy',
    ServiceNamespace='sagemaker', 
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': queue_target_value,
        'CustomizedMetricSpecification': {
            'MetricName': 'ApproximateBacklogSize',
            'Namespace': 'AWS/SageMaker',
            'Dimensions': [{'Name': 'EndpointName', 'Value': endpoint_name}],
            'Statistic': 'Average',
        },
        'ScaleInCooldown': 300
        'ScaleOutCooldown': 120
    }
)

response_scaling = client.put_scaling_policy(
    PolicyName="HasBacklogWithoutCapacity-ScalingPolicy",
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="StepScaling",
    StepScalingPolicyConfiguration={
        "AdjustmentType": "ChangeInCapacity",
        "MetricAggregationType": "Average",
        "Cooldown": 300,
        "StepAdjustments":[ 
            {
              "MetricIntervalLowerBound": 0,
              "ScalingAdjustment": 1
            }
        ]
    } 
)
cw_client = boto3.client('cloudwatch', **auth_kwargs)
response = cw_client.put_metric_alarm(
    AlarmName=f"{endpoint_name}/Backlog-without-capacity",
    MetricName='HasBacklogWithoutCapacity',
    Namespace='AWS/SageMaker',
    Statistic='Average',
    EvaluationPeriods= 1,
    DatapointsToAlarm= 1,
    Threshold= 1,
    ComparisonOperator='GreaterThanOrEqualToThreshold',
    TreatMissingData='missing',
    Dimensions=[{ 'Name':'EndpointName', 'Value': endpoint_name }],
    Period= 60,
    AlarmActions=[response_scaling['PolicyARN']]
)
已提問 1 年前檢視次數 159 次
沒有答案

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南