Sagemaker Async endpoint autoscaling - how to make it work well?

0

Hi There, I'm in trouble with autoscaling related to Sagemaker Async Endpoint. In Particular, I have 3 cloudwatch alarms that trigger the scaling policy:

  • ApproximateBacklogSizePerInstance < 4.5 for 15 datapoints within 15 minutes
  • ApproximateBacklogSizePerInstance > 5 for 3 datapoints within 3 minutes
  • HasBacklogWithoutCapacity >= 1 for 1 datapoints within 1 minute

When the scaling out happens, my endpoint remains stucked in updating status until all the processing of enqueued messages is over. This implies in some errors like:

  • The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch log for this endpoint. -> Because of this error, several instances are created and destroyed (I checked in cloudTrail history actions). This results in some messages being "stranded" during processing, which unfortunately is not completed anymore. This only resumes when another scaling IN action is performed. (If, for example, I have 50 messages to process, my endpoint will reach the configured maximum number of instances and produce the above error. Than, it will not process some messages. Processing of these will only resume when a scaling IN is performed).

  • Received error: "Resource endpoint/endpoint_name/variant/AllTraffic was not in a scalable state due to reason: Endpoint X is in Updating state, should be in InService state to scale.";

Moreover, when the scaling in to 0 instances is performed and there are some messages that are being processed (in a number less than our threshold), some of them remains "stranded" (similar to the first point) and are no longer processed (no errors are produced). Only when another scale activity is performed, these messages becomes again "visible" and so processed (it is as if there was a visibility timeout for the queue inside the Sagemaker Endpoint).

How could I solve these problems?? Seems to be a bug.

The python code I'm using to create the Endpoint is the following:

model = TensorFlowModel(
    source_dir=f'../models/{model_name}/code',
    entry_point='inference.py',
    sagemaker_session=session,
    model_data=model_data_tar_gz, 
    role=tf_model_role, 
    image_uri=image_uri,
    name=model_sg_name,
    code_location=final_model_output,
    env={
        
        'OMP_NUM_THREADS': '1',
        'SAGEMAKER_TFS_INTRA_OP_PARALLELISM': '1', # Setting this environment variable to the number of available physical cores is recommended. (g5.4x -> 16)

    }
    ,
    vpc_config={
        "Subnets": subnets,
        "SecurityGroupIds": security_groups
    }
)
predictor = model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    async_inference_config=AsyncInferenceConfig(
        output_path=out_inference,
        max_concurrent_invocations_per_instance=max_invocation_instance,
        notification_config={
            "SuccessTopic": success_topic,
            "ErrorTopic": error_topic
        }
    )
)


client = boto3.client('application-autoscaling', **auth_kwargs)
response = client.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=scaling_min,
    MaxCapacity=scaling_max
)


response = client.put_scaling_policy(
    PolicyName='Invocations-ScalingPolicy',
    ServiceNamespace='sagemaker', 
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': queue_target_value,
        'CustomizedMetricSpecification': {
            'MetricName': 'ApproximateBacklogSize',
            'Namespace': 'AWS/SageMaker',
            'Dimensions': [{'Name': 'EndpointName', 'Value': endpoint_name}],
            'Statistic': 'Average',
        },
        'ScaleInCooldown': 300
        'ScaleOutCooldown': 120
    }
)

response_scaling = client.put_scaling_policy(
    PolicyName="HasBacklogWithoutCapacity-ScalingPolicy",
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="StepScaling",
    StepScalingPolicyConfiguration={
        "AdjustmentType": "ChangeInCapacity",
        "MetricAggregationType": "Average",
        "Cooldown": 300,
        "StepAdjustments":[ 
            {
              "MetricIntervalLowerBound": 0,
              "ScalingAdjustment": 1
            }
        ]
    } 
)
cw_client = boto3.client('cloudwatch', **auth_kwargs)
response = cw_client.put_metric_alarm(
    AlarmName=f"{endpoint_name}/Backlog-without-capacity",
    MetricName='HasBacklogWithoutCapacity',
    Namespace='AWS/SageMaker',
    Statistic='Average',
    EvaluationPeriods= 1,
    DatapointsToAlarm= 1,
    Threshold= 1,
    ComparisonOperator='GreaterThanOrEqualToThreshold',
    TreatMissingData='missing',
    Dimensions=[{ 'Name':'EndpointName', 'Value': endpoint_name }],
    Period= 60,
    AlarmActions=[response_scaling['PolicyARN']]
)
質問済み 1年前159ビュー
回答なし

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ