By using AWS re:Post, you agree to the AWS re:Post Terms of Use

Is GPU Serverless inferencing for custom LLM models?

0

Wanted to check if AWS supports GPU inferencing via serverless compute (dynamic loading), since I don't want to spend $1,5/h for EC2 instance, which my client will use not more than 5 minutes per month.

3 Answers
2

Serverless GPU is not supported in SageMaker since it is based on Lambda technology, which currently doesn't support GPU. As an alternative, you can host custom models on Amazon Bedrock, and they will be served in a serverless way. But you need to note that currently you can only use Flan, Llama, and Mistral families.

https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-import-model.html

AWS
answered 15 days ago
profile picture
EXPERT
reviewed 15 days ago
2

For SageMaker serverless endpoint GPU is not supported.

However, If your customer is happy with a cool down, you can use SageMaker Async inference (see https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html) and scale the instance to 0 when not in use. There will be more managing involved here, but nothing too complex. You can even implement a parking lot system like of approach where if your customer knows when they need the endpoint, you can have the Async endpoint scale at that time, and run for specified credit then scale down.

answered 15 days ago
profile picture
EXPERT
reviewed 15 days ago
1

AWS does offer GPU-accelerated serverless inferencing options, but there are some important details to consider for your specific use case:

Amazon SageMaker provides serverless inference capabilities that can scale to handle thousands of models per endpoint and millions of transactions per second, with sub-10 millisecond overhead latencies. This service allows you to deploy machine learning models without having to manage the underlying infrastructure.

However, it's important to note that while SageMaker offers GPU instances for inference, including Amazon EC2 G4dn instances, the serverless inference option is not explicitly stated to use GPUs in the provided information.

For your specific requirement of very infrequent usage (only 5 minutes per month), serverless inference could potentially be a cost-effective solution compared to running a dedicated EC2 GPU instance. Serverless inference allows you to pay only for the compute capacity you use, which could be advantageous for your low-volume use case.

That said, the exact pricing and GPU availability for custom LLM models using serverless inference would depend on the specific requirements of your model and the current offerings from AWS. It would be advisable to consult the latest AWS documentation or reach out to AWS support for the most up-to-date information on GPU-enabled serverless inference options for custom LLM models, as the cloud services landscape is continually evolving.
Sources
Machine Learning Inference - Amazon SageMaker Model Deployment - AWS
Announcing the General Availability of Amazon EC2 G4dn Bare Metal Instances - GPU instances with up to 8 NVIDIA T4 GPUs

profile picture
answered 15 days ago
profile picture
EXPERT
reviewed 15 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions