- Newest
- Most votes
- Most comments
Hi, from what you describe, Sagemaker Serverless Inference seems to be the right solution for your use case.
See this blog announcing the Preview: https://aws.amazon.com/blogs/machine-learning/deploying-ml-models-using-sagemaker-serverless-inference-preview/
SageMaker will automatically provision, scale, and terminate compute
capacity based on the inference request volume. SageMaker Serverless
Inference also means that you only pay for the duration of running the
inference code and the amount of data processed, not for idle time.
Moreover, you can scale to zero to optimize your inference costs.
Serverless Inference is a great choice for customers that have
intermittent or unpredictable prediction traffic.
It is now GA and service doc is here: https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html
Hope it helps!
Didier
Hi Xuan
Agree with Didier above, that serverless is a good option given your intermittent and low workload.
Having said that, Sagemaker serverless endpoints don't support GPUs or large instance sizes (6GB is the maximum you can request through Service Quotas). Most (if not all) LLMs would be too large, and depending on latency you may require GPU performance.
Without knowing your full requirements, I'd be looking at alternative ways to host your LLM if always-on realtime endpoints are ruled out due to cost.
Cheers James
Hi James,
In my case, as mentioned previously, the endpoint receives frequently sporadic traffic (so I don't need a continuously running server), the model size is (much) over 20GB.
As I understand it, serverless inference can handle such large models but the latency would be high. Is that right? At this stage, latency isn't our priority and can be ignored. But would you please suggest a solution for the use case with latency taken into account?
On top of that, could you explain more about the role of the memory? If the model is 30GB in size, a 6GB memory means the model can't be hosted?
Thanks
Please consider evaluating the Amazon Bedrock functionality "Import custom models" (preview) for this use case. It was launched earlier this year.
Info in this blog post: https://aws.amazon.com/blogs/aws/import-custom-models-in-amazon-bedrock-preview/
Relevant content
- asked 5 months ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 10 days ago
- AWS OFFICIALUpdated 9 days ago
Hi Didier,
My model size is over 20GB. But the RAM memory upper threshold of Serverless Inference is only 6GB, much lower than that (https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html). Would it work?