Does AWS Sagemaker real time inference service, charge us when not inferencing?


I'm currently working on a problem where the pipeline is such that I need to perform object detection on images as soon as they are uploaded. My current setup involves triggering an EC2 instance with GPUs upon image upload using Terraform, loading a custom model's Docker image, loading necessary libraries, initializing the environment, and finally performing inference. However, this process is taking longer than desired, with a total latency of approximately 4 minutes and 50 seconds. (ec2 startup time is 2 mins, loading of libraries is 2 minutes and initilization is 30 secs and the actual inference is 20 secs) I've heard that Amazon SageMaker's real-time inference capabilities can provide faster inference times without the overhead of startup, library loading, and initialization. Additionally, I've been informed that SageMaker only charges for the actual inference time, rather than keeping me continuously billed for an active endpoint. I'd like to understand more about how AWS SageMaker's real-time inference works and whether it can help me achieve my goal of receiving object detection results within 20-30 seconds of image upload. Are there any best practices or strategies I should be aware of when using SageMaker for real-time inference? Also, I would like to auto scale based on the load. For instance, if 10 images are uploaded all at once, the scaling should happen automatically. Any insights, experiences, or guidance on leveraging SageMaker for real-time object detection would be greatly appreciated.

2 Answers

Hi Tanzeem,

When you speak of "Amazon SageMaker's real-time inference", you probably mean Sagemaker Serverless Inference, which is done exactly for your use case: it is permanently up but you get charged on a pay-as-you-use mode by milliseconds when inference are executed by the model.


See quote below: it explains how it scales up and down for you at no effort on your side. So, it will take care of parallel requests for you. The mentionned cold start time will be much lower than the times that you mention above.

Amazon SageMaker Serverless Inference is a purpose-built inference option that enables you to 
deploy and scale ML models without configuring or managing any of the underlying infrastructure. 
On-demand Serverless Inference is ideal for workloads which have idle periods between traffic spurs 
and can tolerate cold starts. Serverless endpoints automatically launch compute resources and scale 
them in and out depending on traffic, eliminating the need to choose instance types or manage scaling 
policies. This takes away the undifferentiated heavy lifting of selecting and managing servers. Serverless 
Inference integrates with AWS Lambda to offer you high availability, built-in fault tolerance and 
automatic scaling. With a pay-per-use model, Serverless Inference is a cost-effective option if you 
have an infrequent or unpredictable traffic pattern. During times when there are no requests, 
Serverless Inference scales your endpoint down to 0, helping you to minimize your costs.



profile pictureAWS
answered 2 months ago

Sagemaker has currently 4 types of inference options.

a) Real-Time Inference b) Serverless Inference c) Batch Transform c) Asynchronous Inference

Enter image description here

From your use case description, your are using a model that requires GPU's for inference and currently Sagemaker Serverless inference (which charges you on pay as you use mode) does not support GPU's. Since your aim is to get inference results within 20-30 seconds of image upload, realtime inference would be a suitable option. However with realtime inference you will be charged when not inferencing as the endpoint is kept alive. with autoscaling you can scale down the instances to 1 but not zero.

If there is not strict time line of 20-30 seconds, you might consider exploring Asynchronous Inference option, which provides near real-time latency requirements. Asynchronous Inference enables you to save on costs by autoscaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests.

Regarding the best practices please refer

Additionally, you might reach out to a AWS Solution Architect or AWS pro serv team to analyse your requirement provide you a suitable option for your requirement.

answered 2 months ago
  • Thank you so much Lavaraja. This is quite helpful. Can you please tell me if, with real-time inferencing, I can keep the end point alive using a very basic low-cost instance and scale it up horizontally and vertically, the moment an image or multiple images are uploaded? I can then use a scaled-up instance for each image (Multiple instances for multiple images) and scale it down back again.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions