How do I determine Sagemaker serverless inference usage?

0

The pricing example for On-demand Serverless Inference explains how compute charges are calculated with total inference duration being one of the factors. But how do I determine this duration on requests for the model I have deployed? Is there a metric or log entry in Cloudwatch that I can look at to see what inference durations I have for my endpoint?

2 Answers
0

If I well understood your question, I think you may find the answer here [1]. To help you debug your endpoints you may check this [2]. And to monitor a serverless endpoint you may check this [3]

Resources :

[1] - https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html [2] - https://docs.aws.amazon.com/sagemaker/latest/dg/logging-cloudwatch.html [3] - https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints-monitoring.html

AWS
answered 7 months ago
0

Hi,

the Sagemaker Serverless Inference Model pricing is well detailled at https://aws.amazon.com/sagemaker/pricing/

Amazon SageMaker Serverless Inference

Amazon SageMaker Serverless Inference enables you to deploy machine learning models 
for inference without configuring or managing any of the underlying infrastructure. 
You can either use on-demand Serverless Inference or add Provisioned Concurrency to your 
endpoint for predictable performance.

With on-demand Serverless Inference, you only pay for the compute capacity used to process
 inference requests, billed by the millisecond, and the amount of data processed. The compute charge depends on the memory configuration you choose.

This means that you need 4 input variables to your model and you have 1 parameter:

  1. parameter is the size of memory that you select: it depends on the size of your model -> cost/sec
  2. var1: number of inference in a period, defined by your business case
  3. var2: avg duration of inferences (coming from measurements in your initial tests)
  4. var 3: avg size of you prompt (if your model is an LLM)
  5. var 4: avg size of prompt completion.

So. cost will be cost/sec [of given param] x (var2 x var3) + 0.16 x (prompt + completion)

parameter will give you price per second of inference. var1 x var2 will give you duration of inferences. var3

If you go with more advanced Provisioned Concurrency: you have to replace cost/sec by (Provisioned Concurrency Usage Price per second + Inference Duration Price per second) for your selected memory size.

Best,

Didier

profile pictureAWS
EXPERT
answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions