- Newest
- Most votes
- Most comments
Based on your description, it seems you're encountering some challenges with configuring your Amazon SageMaker Asynchronous Endpoint to scale properly. Let's address your concerns and provide some insights on how to improve your setup.
First, it's important to note that Amazon SageMaker Asynchronous Inference is designed to queue incoming requests and process them asynchronously. This feature is ideal for requests with large payload sizes (up to 1GB) and long processing times (up to one hour), which aligns with your use case of a model that takes up to 40 minutes to process input.
The behavior you're experiencing where the instance shuts down after 10 minutes of inactivity is actually a feature of SageMaker's new scale-to-zero capability. This feature automatically stops all underlying instances of the endpoint after a period of inactivity, eliminating associated instance costs. This is designed to save on costs by only paying when your endpoint is processing requests.
To address your specific requirements:
-
Scaling to zero: This is already happening as designed after 10 minutes of inactivity.
-
Scaling out from zero: When a new request comes in while the endpoint is scaled to zero, SageMaker should automatically start provisioning a new instance to handle the request. However, this process can take a few minutes.
-
Long-running jobs: For your 40-minute processing jobs, you need to ensure that your scaling policies and endpoint configuration allow for these long-running tasks without interruption.
Here are some suggestions to improve your setup:
-
Increase the scale-in cooldown period: In your scale-in policy, you have set a cooldown of 2400 seconds (40 minutes). Consider increasing this to account for your long-running jobs plus some buffer time.
-
Adjust your CPU utilization policy: Instead of using average CPU utilization, consider using maximum CPU utilization. This can help prevent scaling in while a job is still running.
-
Review your CloudWatch alarms: Ensure you have appropriate alarms set up to trigger your scaling policies. For example, you might want an alarm that triggers scale-out when there are pending requests in the queue.
-
Check your endpoint configuration: Make sure your endpoint is configured correctly for asynchronous inference, including specifying an appropriate
AsyncInferenceConfigobject when creating the endpoint. -
Increase invocation timeout: In your
invoke_endpoint_asynccall, you've setInvocationTimeoutSeconds=3600(1 hour), which is good for your 40-minute jobs. Make sure this aligns with your endpoint configuration. -
Monitor your endpoint: Use CloudWatch to monitor your endpoint's behavior, including instance counts, CPU utilization, and queue metrics. This can help you understand when and why scaling actions are occurring.
Remember, when the endpoint scales to zero and a new request comes in, you may initially see a validation error indicating that the inference component has no capacity. This is normal, and SageMaker should start provisioning a new instance to handle the request within about a minute.
If you're still experiencing issues after making these adjustments, you may need to contact AWS support for further assistance, as there could be account-specific settings or limitations at play.
Sources
Unlock cost savings with the new scale down to zero feature in SageMaker Inference | AWS Machine Learning Blog
Asynchronous inference - Amazon SageMaker AI
Relevant content
- asked 3 years ago
- AWS OFFICIALUpdated 7 months ago
