Sagemaker Real Time Inference pricing clarification


Hello, Can you please help me to comprehend the pricing of Sagemaker Real time Inference? If I choose to go with instance ml.c7g.16xlarge(memory - 128 GiB, vCPUs - 64), suppose following is my payload expectations: for a single request, 2 MB is input payload size and 2 MB is output payload size. The expected requests for a month is 2 million requests. Say the deployed model size is 2GB and it is deployed for 24 hours, 7 days a week. If auto scaling has not been turned on, could you please tell me how much would be the charge for a month? And any restrictions on the number of requests?

Real-time inference cost can be broken down into 2 components:

  1. Per Hour charges of your instance
  2. Data in/out per GB

In your case, you would be charged based on the per hour pricing of ml.c7g.16xlarge instance, and (2mb+2mb)*2mil for data in/out a month. Link to pricing examples can be found here.

If your usage will be consistent for a period of time, do check out savings plan to save cost.

  • Got it. Thank You! Would you please help me with what is the limit when I would need more than one instance in my case? Would it be if the instance gets more than 64 requests at a time?

    You may define your own Auto-Scaling rules and apply that to your hosted model. Some rule examples are like: Invocations per instance, CPU utilization per instance. This will help scale up and down the number of instances you have.

    Reference Link:

