Deploy LLM serverlessly on Sagemaker

0

Is it possible (and efficient) to deploy an LLM model serverlessly using Sagemaker? I'm concerned about the performance and costs involved? The ML application doesn't receive a lot of requests.

已提問 10 個月前檢視次數 2000 次
2 個答案
0

Hi, from what you describe, Sagemaker Serverless Inference seems to be the right solution for your use case.

See this blog announcing the Preview: https://aws.amazon.com/blogs/machine-learning/deploying-ml-models-using-sagemaker-serverless-inference-preview/

SageMaker will automatically provision, scale, and terminate compute 
capacity based on the inference request volume. SageMaker Serverless 
Inference also means that you only pay for the duration of running the 
inference code and the amount of data processed, not for idle time. 
Moreover, you can scale to zero to optimize your inference costs.

Serverless Inference is a great choice for customers that have 
intermittent or unpredictable prediction traffic. 

It is now GA and service doc is here: https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html

Hope it helps!

Didier

profile pictureAWS
專家
已回答 10 個月前
0

Hi Xuan

Agree with Didier above, that serverless is a good option given your intermittent and low workload.

Having said that, Sagemaker serverless endpoints don't support GPUs or large instance sizes (6GB is the maximum you can request through Service Quotas). Most (if not all) LLMs would be too large, and depending on latency you may require GPU performance.

Without knowing your full requirements, I'd be looking at alternative ways to host your LLM if always-on realtime endpoints are ruled out due to cost.

Cheers James

已回答 10 個月前
  • Hi James,

    In my case, as mentioned previously, the endpoint receives frequently sporadic traffic (so I don't need a continuously running server), the model size is (much) over 20GB.

    As I understand it, serverless inference can handle such large models but the latency would be high. Is that right? At this stage, latency isn't our priority and can be ignored. But would you please suggest a solution for the use case with latency taken into account?

    On top of that, could you explain more about the role of the memory? If the model is 30GB in size, a 6GB memory means the model can't be hosted?

    Thanks

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南