Async Inference not able to process later requests

0

Hi there, hope all of you are fine.

I am trying to deploy a train-on-inference type model. I am done with BYOC, and it is working completely fine with real-time inference endpoints. Also, I am able to make it work with Async inference, and concurrent requests on the same instance are also being handled. But, the later requests, never get processed, without any logical error. Also once the endpoint gets scaled down to 0 instance, it fails to scales up.

These are some of error and warning messages which I get intermittently:



data-log:
2022-03-23T11:23:17.723:[sagemaker logs] [5ea751c9-9271-4533-bc09-c117791e1372] Received server error (500) from primary with message "<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">



warnings:
/usr/local/lib/python3.8/dist-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])

Kindly help me with this. Thanks.

  • Can you please share your Async Endpoint config and your autoscaling policy? The reason it may not have scaled up could be due to the higher ScaleOutCooldown times and TargetTrackingScalingPolicyConfiguration. For instance, if you have set the Target Tracking policy to scale based on ApproximateBacklogSizePerInstance and a lower number and if there are not enough requests in the Backlog, the autoscaling policy will not be triggered

  • did you solve this?

asked 2 years ago738 views
2 Answers
1

Hello, I'm running into the exact same issue. I used the same guide and the async endpoint doesn't scale up or down.

answered 2 years ago
0

Hi, hope you are fine. Thanks for getting back to me. This is what I am using:


# Configure Autoscaling on asynchronous endpoint down to zero instances
response = client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=0,
    MaxCapacity=4,
)

response = client.put_scaling_policy(
    PolicyName="Invocations-ScalingPolicy",
    ServiceNamespace="sagemaker",  # The namespace of the AWS service that provides the resource.
    ResourceId=resource_id,  # Endpoint name
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # SageMaker supports only Instance Count
    PolicyType="TargetTrackingScaling",  # 'StepScaling'|'TargetTrackingScaling'
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 2.0,  # The target value for the metric. - here the metric is - SageMakerVariantInvocationsPerInstance
        "CustomizedMetricSpecification": {
            "MetricName": "ApproximateBacklogSizePerInstance",
            "Namespace": "AWS/SageMaker",
            "Dimensions": [{"Name": "EndpointName", "Value": endpoint_name}],
            "Statistic": "Average",
        },
        "ScaleInCooldown": 300,  # The cooldown period helps you prevent your Auto Scaling group from launching or terminating
        # additional instances before the effects of previous activities are visible.
        # You can configure the length of time based on your instance startup time or other application needs.
        # ScaleInCooldown - The amount of time, in seconds, after a scale in activity completes before another scale in activity can start.
        "ScaleOutCooldown": 300  # ScaleOutCooldown - The amount of time, in seconds, after a scale out activity completes before another scale out activity can start.
        # 'DisableScaleIn': True|False - ndicates whether scale in by the target tracking policy is disabled.
        # If the value is true , scale in is disabled and the target tracking policy won't remove capacity from the scalable resource.
    },
)
answered 2 years ago
  • I think ScaleOutCooldown period is logical here. Also I even tried out using 40 invocations, but even then I failed to scale the endpoint up from 0.

  • was it a sustained load test? The reason I ask is because you have set high number for ScaleInCooldown and ScaleInCooldown i.e 5mins. if your tests are not sustained and consistently invoking the endpoint, it won't scale up. Maybe for your tests, set them to a lower value for you to visualize autoscaling. Ref notebook - https://github.com/aws/amazon-sagemaker-examples/blob/main/async-inference/Async-Inference-Walkthrough.ipynb

  • Hi, thanks for helping me out. I don't get what you mean by sustained load. Moreover, thelik you provided, I followed that to reach my solution. Also, reason for high number is, that my pipeline requires heavy installation, and downloads, and it takes almost 7-8 mins to have an instance in running condition. So that's why I did so. Also, I waited for more than even 20-30 mins, but endpoint didn't scales up from 0 to 1 instance

    Thanks

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions