Sagemaker Async endpoint autoscaling - how to make it work well?

0

Hi There, I'm in trouble with autoscaling related to Sagemaker Async Endpoint. In Particular, I have 3 cloudwatch alarms that trigger the scaling policy:

  • ApproximateBacklogSizePerInstance < 4.5 for 15 datapoints within 15 minutes
  • ApproximateBacklogSizePerInstance > 5 for 3 datapoints within 3 minutes
  • HasBacklogWithoutCapacity >= 1 for 1 datapoints within 1 minute

When the scaling out happens, my endpoint remains stucked in updating status until all the processing of enqueued messages is over. This implies in some errors like:

  • The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch log for this endpoint. -> Because of this error, several instances are created and destroyed (I checked in cloudTrail history actions). This results in some messages being "stranded" during processing, which unfortunately is not completed anymore. This only resumes when another scaling IN action is performed. (If, for example, I have 50 messages to process, my endpoint will reach the configured maximum number of instances and produce the above error. Than, it will not process some messages. Processing of these will only resume when a scaling IN is performed).

  • Received error: "Resource endpoint/endpoint_name/variant/AllTraffic was not in a scalable state due to reason: Endpoint X is in Updating state, should be in InService state to scale.";

Moreover, when the scaling in to 0 instances is performed and there are some messages that are being processed (in a number less than our threshold), some of them remains "stranded" (similar to the first point) and are no longer processed (no errors are produced). Only when another scale activity is performed, these messages becomes again "visible" and so processed (it is as if there was a visibility timeout for the queue inside the Sagemaker Endpoint).

How could I solve these problems?? Seems to be a bug.

The python code I'm using to create the Endpoint is the following:

model = TensorFlowModel(
    source_dir=f'../models/{model_name}/code',
    entry_point='inference.py',
    sagemaker_session=session,
    model_data=model_data_tar_gz, 
    role=tf_model_role, 
    image_uri=image_uri,
    name=model_sg_name,
    code_location=final_model_output,
    env={
        
        'OMP_NUM_THREADS': '1',
        'SAGEMAKER_TFS_INTRA_OP_PARALLELISM': '1', # Setting this environment variable to the number of available physical cores is recommended. (g5.4x -> 16)

    }
    ,
    vpc_config={
        "Subnets": subnets,
        "SecurityGroupIds": security_groups
    }
)
predictor = model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    async_inference_config=AsyncInferenceConfig(
        output_path=out_inference,
        max_concurrent_invocations_per_instance=max_invocation_instance,
        notification_config={
            "SuccessTopic": success_topic,
            "ErrorTopic": error_topic
        }
    )
)


client = boto3.client('application-autoscaling', **auth_kwargs)
response = client.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=scaling_min,
    MaxCapacity=scaling_max
)


response = client.put_scaling_policy(
    PolicyName='Invocations-ScalingPolicy',
    ServiceNamespace='sagemaker', 
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': queue_target_value,
        'CustomizedMetricSpecification': {
            'MetricName': 'ApproximateBacklogSize',
            'Namespace': 'AWS/SageMaker',
            'Dimensions': [{'Name': 'EndpointName', 'Value': endpoint_name}],
            'Statistic': 'Average',
        },
        'ScaleInCooldown': 300
        'ScaleOutCooldown': 120
    }
)

response_scaling = client.put_scaling_policy(
    PolicyName="HasBacklogWithoutCapacity-ScalingPolicy",
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="StepScaling",
    StepScalingPolicyConfiguration={
        "AdjustmentType": "ChangeInCapacity",
        "MetricAggregationType": "Average",
        "Cooldown": 300,
        "StepAdjustments":[ 
            {
              "MetricIntervalLowerBound": 0,
              "ScalingAdjustment": 1
            }
        ]
    } 
)
cw_client = boto3.client('cloudwatch', **auth_kwargs)
response = cw_client.put_metric_alarm(
    AlarmName=f"{endpoint_name}/Backlog-without-capacity",
    MetricName='HasBacklogWithoutCapacity',
    Namespace='AWS/SageMaker',
    Statistic='Average',
    EvaluationPeriods= 1,
    DatapointsToAlarm= 1,
    Threshold= 1,
    ComparisonOperator='GreaterThanOrEqualToThreshold',
    TreatMissingData='missing',
    Dimensions=[{ 'Name':'EndpointName', 'Value': endpoint_name }],
    Period= 60,
    AlarmActions=[response_scaling['PolicyARN']]
)
질문됨 일 년 전159회 조회
답변 없음

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠