How do I determine Sagemaker serverless inference usage?

0

The pricing example for On-demand Serverless Inference explains how compute charges are calculated with total inference duration being one of the factors. But how do I determine this duration on requests for the model I have deployed? Is there a metric or log entry in Cloudwatch that I can look at to see what inference durations I have for my endpoint?

已提問 8 個月前檢視次數 409 次
2 個答案
0

If I well understood your question, I think you may find the answer here [1]. To help you debug your endpoints you may check this [2]. And to monitor a serverless endpoint you may check this [3]

Resources :

[1] - https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html [2] - https://docs.aws.amazon.com/sagemaker/latest/dg/logging-cloudwatch.html [3] - https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints-monitoring.html

AWS
已回答 8 個月前
0

Hi,

the Sagemaker Serverless Inference Model pricing is well detailled at https://aws.amazon.com/sagemaker/pricing/

Amazon SageMaker Serverless Inference

Amazon SageMaker Serverless Inference enables you to deploy machine learning models 
for inference without configuring or managing any of the underlying infrastructure. 
You can either use on-demand Serverless Inference or add Provisioned Concurrency to your 
endpoint for predictable performance.

With on-demand Serverless Inference, you only pay for the compute capacity used to process
 inference requests, billed by the millisecond, and the amount of data processed. The compute charge depends on the memory configuration you choose.

This means that you need 4 input variables to your model and you have 1 parameter:

  1. parameter is the size of memory that you select: it depends on the size of your model -> cost/sec
  2. var1: number of inference in a period, defined by your business case
  3. var2: avg duration of inferences (coming from measurements in your initial tests)
  4. var 3: avg size of you prompt (if your model is an LLM)
  5. var 4: avg size of prompt completion.

So. cost will be cost/sec [of given param] x (var2 x var3) + 0.16 x (prompt + completion)

parameter will give you price per second of inference. var1 x var2 will give you duration of inferences. var3

If you go with more advanced Provisioned Concurrency: you have to replace cost/sec by (Provisioned Concurrency Usage Price per second + Inference Duration Price per second) for your selected memory size.

Best,

Didier

profile pictureAWS
專家
已回答 8 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南