- Newest
- Most votes
- Most comments
Serverless GPU is not supported in SageMaker since it is based on Lambda technology, which currently doesn't support GPU. As an alternative, you can host custom models on Amazon Bedrock, and they will be served in a serverless way. But you need to note that currently you can only use Flan, Llama, and Mistral families.
https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-import-model.html
For SageMaker serverless endpoint GPU is not supported.
However, If your customer is happy with a cool down, you can use SageMaker Async inference (see https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html) and scale the instance to 0 when not in use. There will be more managing involved here, but nothing too complex. You can even implement a parking lot system like of approach where if your customer knows when they need the endpoint, you can have the Async endpoint scale at that time, and run for specified credit then scale down.
AWS does offer GPU-accelerated serverless inferencing options, but there are some important details to consider for your specific use case:
Amazon SageMaker provides serverless inference capabilities that can scale to handle thousands of models per endpoint and millions of transactions per second, with sub-10 millisecond overhead latencies. This service allows you to deploy machine learning models without having to manage the underlying infrastructure.
However, it's important to note that while SageMaker offers GPU instances for inference, including Amazon EC2 G4dn instances, the serverless inference option is not explicitly stated to use GPUs in the provided information.
For your specific requirement of very infrequent usage (only 5 minutes per month), serverless inference could potentially be a cost-effective solution compared to running a dedicated EC2 GPU instance. Serverless inference allows you to pay only for the compute capacity you use, which could be advantageous for your low-volume use case.
That said, the exact pricing and GPU availability for custom LLM models using serverless inference would depend on the specific requirements of your model and the current offerings from AWS. It would be advisable to consult the latest AWS documentation or reach out to AWS support for the most up-to-date information on GPU-enabled serverless inference options for custom LLM models, as the cloud services landscape is continually evolving.
Sources
Machine Learning Inference - Amazon SageMaker Model Deployment - AWS
Announcing the General Availability of Amazon EC2 G4dn Bare Metal Instances - GPU instances with up to 8 NVIDIA T4 GPUs
Relevant content
- asked 2 years ago
- asked 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated a year ago