Deploy LLM serverlessly on Sagemaker

0

Is it possible (and efficient) to deploy an LLM model serverlessly using Sagemaker? I'm concerned about the performance and costs involved? The ML application doesn't receive a lot of requests.

asked 10 months ago1917 views
2 Answers
0

Hi, from what you describe, Sagemaker Serverless Inference seems to be the right solution for your use case.

See this blog announcing the Preview: https://aws.amazon.com/blogs/machine-learning/deploying-ml-models-using-sagemaker-serverless-inference-preview/

SageMaker will automatically provision, scale, and terminate compute 
capacity based on the inference request volume. SageMaker Serverless 
Inference also means that you only pay for the duration of running the 
inference code and the amount of data processed, not for idle time. 
Moreover, you can scale to zero to optimize your inference costs.

Serverless Inference is a great choice for customers that have 
intermittent or unpredictable prediction traffic. 

It is now GA and service doc is here: https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html

Hope it helps!

Didier

profile pictureAWS
EXPERT
answered 10 months ago
0

Hi Xuan

Agree with Didier above, that serverless is a good option given your intermittent and low workload.

Having said that, Sagemaker serverless endpoints don't support GPUs or large instance sizes (6GB is the maximum you can request through Service Quotas). Most (if not all) LLMs would be too large, and depending on latency you may require GPU performance.

Without knowing your full requirements, I'd be looking at alternative ways to host your LLM if always-on realtime endpoints are ruled out due to cost.

Cheers James

answered 10 months ago
  • Hi James,

    In my case, as mentioned previously, the endpoint receives frequently sporadic traffic (so I don't need a continuously running server), the model size is (much) over 20GB.

    As I understand it, serverless inference can handle such large models but the latency would be high. Is that right? At this stage, latency isn't our priority and can be ignored. But would you please suggest a solution for the use case with latency taken into account?

    On top of that, could you explain more about the role of the memory? If the model is 30GB in size, a 6GB memory means the model can't be hosted?

    Thanks

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions