Sagemaker Async endpoint autoscaling - how to make it work well?

Hi There, I'm in trouble with autoscaling related to Sagemaker Async Endpoint. In Particular, I have 3 cloudwatch alarms that trigger the scaling policy:

ApproximateBacklogSizePerInstance < 4.5 for 15 datapoints within 15 minutes
ApproximateBacklogSizePerInstance > 5 for 3 datapoints within 3 minutes
HasBacklogWithoutCapacity >= 1 for 1 datapoints within 1 minute

When the scaling out happens, my endpoint remains stucked in updating status until all the processing of enqueued messages is over. This implies in some errors like:

The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch log for this endpoint. -> Because of this error, several instances are created and destroyed (I checked in cloudTrail history actions). This results in some messages being "stranded" during processing, which unfortunately is not completed anymore. This only resumes when another scaling IN action is performed. (If, for example, I have 50 messages to process, my endpoint will reach the configured maximum number of instances and produce the above error. Than, it will not process some messages. Processing of these will only resume when a scaling IN is performed).
Received error: "Resource endpoint/endpoint_name/variant/AllTraffic was not in a scalable state due to reason: Endpoint X is in Updating state, should be in InService state to scale.";

Moreover, when the scaling in to 0 instances is performed and there are some messages that are being processed (in a number less than our threshold), some of them remains "stranded" (similar to the first point) and are no longer processed (no errors are produced). Only when another scale activity is performed, these messages becomes again "visible" and so processed (it is as if there was a visibility timeout for the queue inside the Sagemaker Endpoint).

How could I solve these problems?? Seems to be a bug.

The python code I'm using to create the Endpoint is the following:

model = TensorFlowModel(
    source_dir=f'../models/{model_name}/code',
    entry_point='inference.py',
    sagemaker_session=session,
    model_data=model_data_tar_gz, 
    role=tf_model_role, 
    image_uri=image_uri,
    name=model_sg_name,
    code_location=final_model_output,
    env={
        
        'OMP_NUM_THREADS': '1',
        'SAGEMAKER_TFS_INTRA_OP_PARALLELISM': '1', # Setting this environment variable to the number of available physical cores is recommended. (g5.4x -> 16)

    }
    ,
    vpc_config={
        "Subnets": subnets,
        "SecurityGroupIds": security_groups
    }
)
predictor = model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    async_inference_config=AsyncInferenceConfig(
        output_path=out_inference,
        max_concurrent_invocations_per_instance=max_invocation_instance,
        notification_config={
            "SuccessTopic": success_topic,
            "ErrorTopic": error_topic
        }
    )
)


client = boto3.client('application-autoscaling', **auth_kwargs)
response = client.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=scaling_min,
    MaxCapacity=scaling_max
)


response = client.put_scaling_policy(
    PolicyName='Invocations-ScalingPolicy',
    ServiceNamespace='sagemaker', 
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': queue_target_value,
        'CustomizedMetricSpecification': {
            'MetricName': 'ApproximateBacklogSize',
            'Namespace': 'AWS/SageMaker',
            'Dimensions': [{'Name': 'EndpointName', 'Value': endpoint_name}],
            'Statistic': 'Average',
        },
        'ScaleInCooldown': 300
        'ScaleOutCooldown': 120
    }
)

response_scaling = client.put_scaling_policy(
    PolicyName="HasBacklogWithoutCapacity-ScalingPolicy",
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="StepScaling",
    StepScalingPolicyConfiguration={
        "AdjustmentType": "ChangeInCapacity",
        "MetricAggregationType": "Average",
        "Cooldown": 300,
        "StepAdjustments":[ 
            {
              "MetricIntervalLowerBound": 0,
              "ScalingAdjustment": 1
            }
        ]
    } 
)
cw_client = boto3.client('cloudwatch', **auth_kwargs)
response = cw_client.put_metric_alarm(
    AlarmName=f"{endpoint_name}/Backlog-without-capacity",
    MetricName='HasBacklogWithoutCapacity',
    Namespace='AWS/SageMaker',
    Statistic='Average',
    EvaluationPeriods= 1,
    DatapointsToAlarm= 1,
    Threshold= 1,
    ComparisonOperator='GreaterThanOrEqualToThreshold',
    TreatMissingData='missing',
    Dimensions=[{ 'Name':'EndpointName', 'Value': endpoint_name }],
    Period= 60,
    AlarmActions=[response_scaling['PolicyARN']]
)

トピック

機械学習と AI

タグ

Amazon SageMaker 機械学習と AI

言語

English

rePost-User-9730037

質問済み 1年前159ビュー

回答なし

新しい順
投票が多い順
コメントが多い順

Sagemaker Async endpoint autoscaling - how to make it work well?

関連するコンテンツ