Deploy LLM serverlessly on Sagemaker

0

Is it possible (and efficient) to deploy an LLM model serverlessly using Sagemaker? I'm concerned about the performance and costs involved? The ML application doesn't receive a lot of requests.

質問済み 10ヶ月前2000ビュー
2回答
0

Hi, from what you describe, Sagemaker Serverless Inference seems to be the right solution for your use case.

See this blog announcing the Preview: https://aws.amazon.com/blogs/machine-learning/deploying-ml-models-using-sagemaker-serverless-inference-preview/

SageMaker will automatically provision, scale, and terminate compute 
capacity based on the inference request volume. SageMaker Serverless 
Inference also means that you only pay for the duration of running the 
inference code and the amount of data processed, not for idle time. 
Moreover, you can scale to zero to optimize your inference costs.

Serverless Inference is a great choice for customers that have 
intermittent or unpredictable prediction traffic. 

It is now GA and service doc is here: https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html

Hope it helps!

Didier

profile pictureAWS
エキスパート
回答済み 10ヶ月前
0

Hi Xuan

Agree with Didier above, that serverless is a good option given your intermittent and low workload.

Having said that, Sagemaker serverless endpoints don't support GPUs or large instance sizes (6GB is the maximum you can request through Service Quotas). Most (if not all) LLMs would be too large, and depending on latency you may require GPU performance.

Without knowing your full requirements, I'd be looking at alternative ways to host your LLM if always-on realtime endpoints are ruled out due to cost.

Cheers James

回答済み 10ヶ月前
  • Hi James,

    In my case, as mentioned previously, the endpoint receives frequently sporadic traffic (so I don't need a continuously running server), the model size is (much) over 20GB.

    As I understand it, serverless inference can handle such large models but the latency would be high. Is that right? At this stage, latency isn't our priority and can be ignored. But would you please suggest a solution for the use case with latency taken into account?

    On top of that, could you explain more about the role of the memory? If the model is 30GB in size, a 6GB memory means the model can't be hosted?

    Thanks

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ