SageMaker autoscaling doesn't work as expected

0

I've configured a model for realtime inference and it's working correctly. I'm now trying to apply an autoscaling policy and using boto3 library to put the scaling policy. The code that I've used:

import pprint
import boto3
from sagemaker import get_execution_role
import sagemaker
import json

endpoint_name = "a-deployed-endpoint"
asg_client = boto3.client('application-autoscaling')
resource_id='endpoint/' + endpoint_name + '/variant/' + 'AllTraffic'
asg_client.register_scalable_target(
    ServiceNamespace='sagemaker', #
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=1,
    MaxCapacity=5
)

response = asg_client.put_scaling_policy(
    PolicyName='SageMakerEndpointInvocationScalingPolicy',
    ServiceNamespace='sagemaker',  
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 20.0,
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance',
        },
        'ScaleInCooldown': 60, 
        'ScaleOutCooldown': 10 
    }
)

To confirm the policy has been configured correctly, I've double-checked with describe_scaling_policies method from boto3 sdk and on AWS console.

With describe_scaling_policies method, we have:

res = asg_client.describe_scaling_policies(
    PolicyNames=['SageMakerEndpointInvocationScalingPolicy'],
    ServiceNamespace='sagemaker', # The namespace of the AWS service that provides the resource. 
    ResourceId=resource_id, # Endpoint name 
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', # SageMaker supports only Instance Count
)
print(res)

###### print's result ######
{
    'ScalingPolicies': [
        {
            'PolicyARN': 
'arn:aws:autoscaling:ap-southeast-1:***:scalingPolicy:c97d566f-50f0-4ac4-ac23-765a815095e9:resource/sagema
ker/endpoint/jets-gpu-endpoint/variant/AllTraffic:policyName/SageMakerEndpointInvocationScalingPolicy',
            'PolicyName': 'SageMakerEndpointInvocationScalingPolicy',
            'ServiceNamespace': 'sagemaker',
            'ResourceId': 'endpoint/jets-gpu-endpoint/variant/AllTraffic',
            'ScalableDimension': 'sagemaker:variant:DesiredInstanceCount',
            'PolicyType': 'TargetTrackingScaling',
            'TargetTrackingScalingPolicyConfiguration': {
                'TargetValue': 20.0,
                'PredefinedMetricSpecification': {
                    'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance'
                },
                'ScaleOutCooldown': 10,
                'ScaleInCooldown': 60
            },
            'Alarms': [
                {
                    'AlarmName': 
'TargetTracking-endpoint/jets-gpu-endpoint/variant/AllTraffic-AlarmHigh-bc06a793-369a-4515-8a7f-3cb5127455b6',
                    'AlarmARN': 
'arn:aws:cloudwatch:ap-southeast-1:***:alarm:TargetTracking-endpoint/jets-gpu-endpoint/variant/AllTraffic-
AlarmHigh-bc06a793-369a-4515-8a7f-3cb5127455b6'
                },
                {
                    'AlarmName': 
'TargetTracking-endpoint/jets-gpu-endpoint/variant/AllTraffic-AlarmLow-9889edca-45f2-4f73-9f71-76cacf7b5882',
                    'AlarmARN': 
'arn:aws:cloudwatch:ap-southeast-1:***:alarm:TargetTracking-endpoint/jets-gpu-endpoint/variant/AllTraffic-
AlarmLow-9889edca-45f2-4f73-9f71-76cacf7b5882'
                }
            ],
            'CreationTime': datetime.datetime(2023, 5, 26, 7, 3, 12, 714000, tzinfo=tzlocal())
        }
    ],
    'ResponseMetadata': {
        'RequestId': '56238668-ee84-4319-b2a9-8448cac98358',
        'HTTPStatusCode': 200,
        'HTTPHeaders': {
            'x-amzn-requestid': '56238668-ee84-4319-b2a9-8448cac98358',
            'content-type': 'application/x-amz-json-1.1',
            'content-length': '1341',
            'date': 'Fri, 26 May 2023 07:03:14 GMT'
        },
        'RetryAttempts': 0
    }
}

With AWS console: Enter image description here

Everything looks correct, right ? So I head to some load tests to make sure that the autoscaling works fine (for more information, I've used locust library to carry out the test). After 5 minutes, the endpoint starts encountering a bottleneck and also the high-traffic cloudwatch raises an alarm as the image below:

Enter image description here

The weird thing here is after 20 minutes since the cloudwatch raised an alarm, I don't see the endpoint scales out instances as expected, although I've tried to use a low target threshold and small scale out time. So, any clue or suggesstion to make the endpoint scalable?

  • 1 quick comment: You crossed out the endpoint name, which IMO isn't very sensitive, but you left your accountID in the output above it, you might want to redact that

  • Thank Shahad_C, I fixed it!

asked a year ago1178 views
2 Answers
1

There's not quite enough info here to know for sure what happened, but here's places you can look to find out

  1. Since the CloudWatch alarm was in Alarm state, it should have been notifying AutoScaling every minute. You'll see the result of the first notification in the Alarm History. Make sure it shows the action was triggered
  2. Check the AutoScaling Activity History. If nothing comes up, try adding the --include-not-scaled-activities flag, but I'm guessing that won't be needed here, since most likely:
  3. I'm guessing AutoScaling tried to scale, but Sagemaker couldn't fulfill the request for some reason (vCPU limits maybe?). Check on the Sagemaker side to see if its Desired Capacity was changed, and to see if it gave any errors. You could also check in CloudTrail to see if there were any API calls from AutoScaling to Sagemaker around that time trying to scale
AWS
EXPERT
answered a year ago
  • Thanks @Shahad_C for your helpful comment, your guess is right.

    1. Base on your suggestion I've checked the cloud watch logs and saw that AutoScaling tried to scale but got failed.
    2. The error was "pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found"
    3. I used python38, torch==1.12.1, onnxruntime-gpu==1.14.1 & ml.g4dn.xlarge GPU instance

    For further information, with initialized instances I can invoke the deployed model successfully (of course using GPU), but when trying to scale I got the error stated above. do you guess problem?

  • Glad you found the error :D I'm not as familiar with the workings of sagemaker itself. I do see another user who had the same issue (although outside of Sagemaker), and it looks like it was a dependency issue. So my best guess would be a vision mismatch somewhere

  • I discussed this with a coworker who works with Sagemaker more. Are you using an AWS provided image for the model? If so, can you provide the URI of the container used for deploying the endpoint?

  • Hi @Shahad_C

    1. I've used a provided image from AWS, the deployment code as follows:

    env={'SAGEMAKER_REQUIREMENTS': 'requirements.txt'} model = PyTorchModel( entry_point="inference.py", source_dir="code", role=role, env=env, model_data=jets_model_data, framework_version="1.12.1", py_version="py38", ) sagemaker_client = boto3.client('sagemaker')

    And for more information, I tried to run an exported onnx model by using onnxruntime-gpu library.

    1. What does "URI of the container" mean? Could you describe more about it?
0

Thanks @Shahad_C for your helpful comment, your guess is right.

  1. Base on your suggestion I've checked the cloud watch logs and saw that AutoScaling tried to scale but got failed.
  2. The error was "pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found"
  3. I used python38, torch==1.12.1, onnxruntime-gpu==1.14.1 & ml.g4dn.xlarge GPU instance

For further information, with initialized instances I can invoke the deployed model successfully (of course using GPU), but when trying to scale I got the error stated above. Do you guess problem?

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions