- Newest
- Most votes
- Most comments
Yes, it is possible to host Large Language Models (LLMs) in a serverless manner where you only get charged when there are requests, rather than continuously. There are a few options available on AWS that provide this capability:
-
Amazon Bedrock: This is a fully managed service that allows you to access foundation models, including LLMs, via an API. With Bedrock's serverless experience, you can integrate and deploy LLMs into your applications without managing any infrastructure. You only pay for the compute capacity you use when making API calls to the models.
-
Amazon SageMaker Serverless Inference: This option allows you to deploy models without having to manage the underlying infrastructure. It automatically scales down to zero when there's no traffic, so you only pay for the compute capacity used during inference. However, it's important to note that GPU support for SageMaker Serverless Inference is not currently available.
-
Amazon SageMaker Asynchronous Inference: While not strictly serverless, this option allows you to scale your endpoint instances down to zero when not in use. Requests received when there are zero instances are queued for processing once the endpoint scales up. This can be a good alternative if you need GPU support or have longer processing times.
These options provide more cost-effective solutions for scenarios where you have infrequent or unpredictable usage patterns, as you won't be charged continuously even when there are no requests. The choice between these options would depend on your specific requirements, such as the need for GPU support, the specific LLM you want to use, and your application's latency requirements.
Sources
Designing Serverless Integration Patterns for Large Language Models (LLMs) | AWS Compute Blog
Vitech uses Amazon Bedrock to revolutionize information access with AI-powered chatbot | AWS Machine Learning Blog
Is GPU Serverless inferencing for custom LLM models? | AWS re:Post
Model Hosting FAQs - Amazon SageMaker AI
Amazon Bedrock would be the best and easiest option for serverless inference where you only get charged when you interact with the LLM. Pricing is generally based on input and output token per model.
You need though to expand further on your requirements. Bedrock supports foundation models (FMs) from multiple providers. Here you can find which ones. You will need to verify if the want you want to use is there.
You may also want to validate which regions support the LLM of your choice plus any additional features that you are interested in.
If Bedrock doesn't fit your use case, then SageMaker Serverless Inference is another option you can evaluate as mentioned by the agent.
Relevant content
- asked a year ago
- asked 2 years ago
- AWS OFFICIALUpdated 2 months ago