Skip to content

My Sagemaker Asynchronous Endpoint goes down to zero instances after 10 minutes.

0

Hi!

I'm creating an Asynchronous Endpoint because for a model that takes up to 40 minutos to process the input. I want to configure it so it can scale-in to zero instances, and once it recieves an input, then scale-out from zero instances. I've been testing and looking for some policies, but I can't solve my problem. Everytime I launch the endpoint and I ask for processing an input, it starts as it is supposed to do, but, after 10 minutes, the instances is shutted down, and then another one is raised but just waiting for an input.

I've used some policies to control the CPU usage, the number of invocations, etc., but I'm always getting the same problem. When I was testing the use of the CPU, I got some weird behavior, the instances stopped after 5-10 minutes of processing the input, and then, after an hour, another instace appeared with all the job done, as if I have done another invocation to the endpoint (but I didn't).

I want the endpoint to scale-in, i.e., shutting down instances when is not processing any instance for X minutes (that's why I tried to check de %CPU that was being used), and to scale-out from zero, i.e., raise a new instance if an invocation to the endpoint is done and there is no instance already up.

Here is some code for the policies that I've been using.

Scale-In policies:

response = asg_client.register_scalable_target(
        ServiceNamespace="sagemaker",
        ResourceId=resource_id,
        ScalableDimension="sagemaker:variant:DesiredInstanceCount",
        MinCapacity=0,
        MaxCapacity=5,
    )
response_scale_in = asg_client.put_scaling_policy(
        PolicyName = f'scaleinpolicy-{endpoint_name}',
        ServiceNamespace="sagemaker",  # The namespace of the service that provides the resource.
        ResourceId=resource_id,  # Endpoint name
        ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # SageMaker supports only Instance Count
        PolicyType="StepScaling",  # 'StepScaling' or 'TargetTrackingScaling'
        StepScalingPolicyConfiguration={
            "AdjustmentType": "ChangeInCapacity", # Specifies whether the ScalingAdjustment value in the StepAdjustment property is an absolute number or a percentage of the current capacity. 
            "MetricAggregationType": "Average", # The aggregation type for the CloudWatch metrics.
            "Cooldown": 2400, # The amount of time, in seconds, to wait for a previous scaling activity to take effect. 
            "StepAdjustments": # A set of adjustments that enable you to scale based on the size of the alarm breach.
            [ 
                {
                "MetricIntervalUpperBound": 0,
                "ScalingAdjustment": -1
                }
            ]
        },    
    )

CPU Utilization:

response = asg_client.put_scaling_policy(
        PolicyName=policy_name,
        ServiceNamespace='sagemaker',
        ResourceId=resource_id,
        ScalableDimension='sagemaker:variant:DesiredInstanceCount',
        PolicyType='TargetTrackingScaling',
        TargetTrackingScalingPolicyConfiguration={
            'TargetValue': 50.0,
            'CustomizedMetricSpecification':
            {
                'MetricName': 'CPUUtilization',
                'Namespace': '/aws/sagemaker/Endpoints',
                'Dimensions': [
                    {'Name': 'EndpointName', 'Value': endpoint_name},
                    {'Name': 'VariantName', 'Value': "AllTraffic"}
                ],
                'Statistic': 'Average',
                # 'Statistic': 'Maximum',
                'Unit': 'Percent'
            },
            'ScaleOutCooldown': scale_out_cool_down,
            'ScaleInCooldown': scale_in_cool_down,
            'DisableScaleIn': False
        }
    )

Scale out (from zero):

    response = asg_client.put_scaling_policy(
        PolicyName = f'HasBacklogWithoutCapacity-ScalingPolicy-{endpoint_name}',
        ServiceNamespace="sagemaker",  # The namespace of the service that provides the resource.
        ResourceId=resource_id,  # Endpoint name
        ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # SageMaker supports only Instance Count
        PolicyType="StepScaling",  # 'StepScaling' or 'TargetTrackingScaling'
        StepScalingPolicyConfiguration={
            "AdjustmentType": "ChangeInCapacity", # Specifies whether the ScalingAdjustment value in the StepAdjustment property is an absolute number or a percentage of the current capacity. 
            "MetricAggregationType": "Average", # The aggregation type for the CloudWatch metrics.
            "Cooldown": 300, # The amount of time, in seconds, to wait for a previous scaling activity to take effect. 
            "StepAdjustments": # A set of adjustments that enable you to scale based on the size of the alarm breach.
            [ 
                {
                "MetricIntervalLowerBound": 0.0,
                "ScalingAdjustment": 1
                }
            ]
        },    
    )

I'm invoking the endpoint as follows:

    response = sm_client.invoke_endpoint_async(
        EndpointName=endpoint_name,
        ContentType="application/octet-stream", # File to be processed
        InputLocation="s3://path_to_s3_file", 
        Accept="application/xml",
        InvocationTimeoutSeconds=3600,
        CustomAttributes='some_atribs',
        InferenceId='gened_id'
    )

I've being seraching in the rePost, but any configuration is working for me, and I don't understand what is happening with the alarms or the policies.

I have tried to change more things, like the AI comment said that was to using the 'Maximum' for the CPU instead of the 'Average', and setting it to 800% of CPU use, and what I get is the following: First, it creates an instance, 10 minutes later it shutdown the instance and raise a new one. If I make another invocation, the new instance get the invocation, process it, and then process the other petition which processing was "killed" with the first instance. If I try to make more invocations, the instances get shutdown after 10 mins. I don't get any error file in the s3 path for error logs, but when the input is processed as it is supposed to do, it is correct. I don't know whats happening or if this can help. I have uploaded an image that can help to understand better what I'm trying to explain.

%CPU used

Thank you very much in advance!!

1 Answer
0

Based on your description, it seems you're encountering some challenges with configuring your Amazon SageMaker Asynchronous Endpoint to scale properly. Let's address your concerns and provide some insights on how to improve your setup.

First, it's important to note that Amazon SageMaker Asynchronous Inference is designed to queue incoming requests and process them asynchronously. This feature is ideal for requests with large payload sizes (up to 1GB) and long processing times (up to one hour), which aligns with your use case of a model that takes up to 40 minutes to process input.

The behavior you're experiencing where the instance shuts down after 10 minutes of inactivity is actually a feature of SageMaker's new scale-to-zero capability. This feature automatically stops all underlying instances of the endpoint after a period of inactivity, eliminating associated instance costs. This is designed to save on costs by only paying when your endpoint is processing requests.

To address your specific requirements:

  1. Scaling to zero: This is already happening as designed after 10 minutes of inactivity.

  2. Scaling out from zero: When a new request comes in while the endpoint is scaled to zero, SageMaker should automatically start provisioning a new instance to handle the request. However, this process can take a few minutes.

  3. Long-running jobs: For your 40-minute processing jobs, you need to ensure that your scaling policies and endpoint configuration allow for these long-running tasks without interruption.

Here are some suggestions to improve your setup:

  1. Increase the scale-in cooldown period: In your scale-in policy, you have set a cooldown of 2400 seconds (40 minutes). Consider increasing this to account for your long-running jobs plus some buffer time.

  2. Adjust your CPU utilization policy: Instead of using average CPU utilization, consider using maximum CPU utilization. This can help prevent scaling in while a job is still running.

  3. Review your CloudWatch alarms: Ensure you have appropriate alarms set up to trigger your scaling policies. For example, you might want an alarm that triggers scale-out when there are pending requests in the queue.

  4. Check your endpoint configuration: Make sure your endpoint is configured correctly for asynchronous inference, including specifying an appropriate AsyncInferenceConfig object when creating the endpoint.

  5. Increase invocation timeout: In your invoke_endpoint_async call, you've set InvocationTimeoutSeconds=3600 (1 hour), which is good for your 40-minute jobs. Make sure this aligns with your endpoint configuration.

  6. Monitor your endpoint: Use CloudWatch to monitor your endpoint's behavior, including instance counts, CPU utilization, and queue metrics. This can help you understand when and why scaling actions are occurring.

Remember, when the endpoint scales to zero and a new request comes in, you may initially see a validation error indicating that the inference component has no capacity. This is normal, and SageMaker should start provisioning a new instance to handle the request within about a minute.

If you're still experiencing issues after making these adjustments, you may need to contact AWS support for further assistance, as there could be account-specific settings or limitations at play.

Sources
Unlock cost savings with the new scale down to zero feature in SageMaker Inference | AWS Machine Learning Blog
Asynchronous inference - Amazon SageMaker AI

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.