AWS SageMaker Real-Time Inference: scaling down to 0 instances

0

Hello, We would like to use AWS SageMaker to run our AI models, but the fact that we can't downscale the instances to 0 is very problematic for us as we'll need to duplicate this infrastructure on our various environments (develop, staging, production) and on multiple regions, and this isn't possible cost-wise. Is there a specific reason why this isn't possible, and can we expect this to change soon? What are the solutions that you would suggest to solve this issue, we were thinking of the following:

  1. Using Kubernetes + Triton (similar to this blog). The main issue being the complexity of the system.
  2. Using SageMaker Asynchronous Inference. The issue is that we're not sure of the impact on speed, latency, etc. and having the calls asynchronous adds complexity.

Thank you!

1回答
0

Hi,

Why don't you try using SageMaker Serverless Inference instead ? It's purely serverless in nature so you pay only when the endpoint is serving inference.

See https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html

Wouldn't that be a better solution for your use case?

Best,

Didier

profile pictureAWS
エキスパート
回答済み 6ヶ月前
  • Hello Didier,

    Thank you for your answer. I have a few questions regarding SageMaker Serverless Inference:

    1. Does it support multiple models under one endpoint?
    2. Do the underlying instances have accelerated computing possibilities?

    Thank you for your help!

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ