- 最新
- 投票最多
- 评论最多
Hi Tanzeem,
When you speak of "Amazon SageMaker's real-time inference", you probably mean Sagemaker Serverless Inference, which is done exactly for your use case: it is permanently up but you get charged on a pay-as-you-use mode by milliseconds when inference are executed by the model.
See https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html
See quote below: it explains how it scales up and down for you at no effort on your side. So, it will take care of parallel requests for you. The mentionned cold start time will be much lower than the times that you mention above.
Amazon SageMaker Serverless Inference is a purpose-built inference option that enables you to
deploy and scale ML models without configuring or managing any of the underlying infrastructure.
On-demand Serverless Inference is ideal for workloads which have idle periods between traffic spurs
and can tolerate cold starts. Serverless endpoints automatically launch compute resources and scale
them in and out depending on traffic, eliminating the need to choose instance types or manage scaling
policies. This takes away the undifferentiated heavy lifting of selecting and managing servers. Serverless
Inference integrates with AWS Lambda to offer you high availability, built-in fault tolerance and
automatic scaling. With a pay-per-use model, Serverless Inference is a cost-effective option if you
have an infrequent or unpredictable traffic pattern. During times when there are no requests,
Serverless Inference scales your endpoint down to 0, helping you to minimize your costs.
Best,
Didier
Sagemaker has currently 4 types of inference options.
a) Real-Time Inference b) Serverless Inference c) Batch Transform c) Asynchronous Inference
From your use case description, your are using a model that requires GPU's for inference and currently Sagemaker Serverless inference (which charges you on pay as you use mode) does not support GPU's. Since your aim is to get inference results within 20-30 seconds of image upload, realtime inference would be a suitable option. However with realtime inference you will be charged when not inferencing as the endpoint is kept alive. with autoscaling you can scale down the instances to 1 but not zero.
If there is not strict time line of 20-30 seconds, you might consider exploring Asynchronous Inference option, which provides near real-time latency requirements. Asynchronous Inference enables you to save on costs by autoscaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests.
Regarding the best practices please refer https://docs.aws.amazon.com/sagemaker/latest/dg/best-practices.html
Additionally, you might reach out to a AWS Solution Architect or AWS pro serv team to analyse your requirement provide you a suitable option for your requirement.
Thank you so much Lavaraja. This is quite helpful. Can you please tell me if, with real-time inferencing, I can keep the end point alive using a very basic low-cost instance and scale it up horizontally and vertically, the moment an image or multiple images are uploaded? I can then use a scaled-up instance for each image (Multiple instances for multiple images) and scale it down back again.
Thanks for the response Didier. If it is serverless, are the necessary dependencies loaded, initialized and ready to use for my custom model?
Hi Tanzeem, the steps for preparing a model for Serverless inference are strictly identical to the preparation of a model for Real-Time inferences. See this blog for confirmation: https://aws.amazon.com/blogs/machine-learning/deploying-ml-models-using-sagemaker-serverless-inference-preview/ It states it very explicitly. So, yes, all the dependencies that were packed with your real-time model will also be present in your serverless model.