Skip to content

What are the cost effective options for on-demand API of fine tuned llm with gpu

0

Hello, I'm looking for the most cost effective option for inference on a llama 3.1 8b instruct fine tuned model through an API endpoint. Considering that:

  • Sagemaker serverless would be perfect, but does not support gpus.
  • Sagemaker endpoints charge per hour as long as they are in-service.
  • Bedrock provisioned throughput charges per hour as long as they are in-service.

I could deploy the endpoint, do the inference and delete it but it would hardly be practical, if at all viable, with a 12gb model.

Appreciate your thoughts and please correct any misconceptions I might have.

3 Answers
0

. Spot Instances or Auto-scaling with SageMaker:

Spot Instances: Instead of using on-demand SageMaker endpoints, consider using spot instances for GPU-backed endpoints. While spot instances can be interrupted, they are much cheaper and could work for batch processing workloads or scheduled inference jobs. Auto-scaling Endpoints: If your inference requests are infrequent, you can configure auto-scaling on SageMaker endpoints so that the number of instances scales to zero when idle. This will save costs during periods of inactivity.

2. Lambda with Elastic Inference (EI)

Elastic Inference: Though Lambda does not directly support GPUs, you could explore Elastic Inference (EI) on Amazon EC2 instances or SageMaker to add low-cost GPU acceleration. EI allows you to attach inference accelerators to your instance at a lower cost than provisioning full GPUs.

3. Run on GPU-Backed EC2 Instances (with Spot Pricing)

EC2 with Spot Pricing: Instead of using SageMaker endpoints, you could manage the deployment yourself using EC2 spot instances with GPU support (such as the g4dn family). Combine this with a strategy of launching the EC2 instance only when needed, running inference, and then stopping or terminating it. With proper automation, this can be a more cost-efficient solution for workloads that don’t require always-on availability.

4. Hugging Face Inference API (for LLaMA Models)

Hugging Face Inference Endpoint: The Hugging Face platform provides managed inference API endpoints with support for models like LLaMA. They have usage-based pricing, so you only pay for what you use, which can be more cost-efficient than hourly pricing if your usage is sporadic. Hugging Face also offers the “Accelerated Inference API”, which uses GPUs and is optimized for high-performance inference at competitive pricing.

5. NVIDIA Triton Inference Server

Self-hosting with Triton: If you want more control over your infrastructure and deployment, consider using NVIDIA Triton Inference Server. It is highly optimized for inference on GPUs and supports models like LLaMA. You could deploy this on a GPU-backed EC2 instance, Kubernetes cluster, or even on-prem hardware. Triton allows batching of inference requests, which can further reduce costs if your workload allows batching.

6. Serverless Inference with GCP Vertex AI

Google Cloud Vertex AI: Google Cloud's Vertex AI offers serverless machine learning, including inference on models using GPUs. This can be more cost-effective because you pay per inference request rather than for continuous instance uptime. It supports GPUs for high-performance models.

7. OpenAI or Other Managed LLM Services

Managed Services: If the fine-tuned LLaMA model isn't a hard requirement and you just need the inference capability of a fine-tuned LLM, consider OpenAI or other managed services that provide API access to similar models (e.g., GPT-4). These are usage-based services that can be cheaper for lower traffic workloads compared to deploying your own infrastructure.

8. Hybrid Model: EC2 Spot + SageMaker Batch

Hybrid Approach: You could combine EC2 Spot Instances for cost-effective GPU inference with SageMaker Batch Transform Jobs for running inference in batch mode (instead of serving it in real-time). Batch jobs allow for better cost control since you're billed only for the resources used during the batch process.

Optimizations to Consider:

Inference Batching: Grouping multiple requests together (batching) can greatly reduce costs on any platform, as GPUs perform well with larger inputs. On-Demand vs. Reserved: If you anticipate consistent usage, reserving instances (Reserved Instances in AWS or Committed Use in GCP) may yield savings. Dynamic Endpoint Scaling: For models that need to remain online, investigate auto-scaling mechanisms that reduce resource use during low-traffic times.

Conclusion:

For on-demand, cost-effective inference of your fine-tuned LLaMA model:

EC2 spot instances with GPU is a strong option for cost-efficiency, especially if you can automate starting/stopping the instance. Hugging Face offers a more convenient API-based option with usage-based billing. Vertex AI or NVIDIA Triton are also great choices if you prefer a cloud or self-hosted approach.

EXPERT
answered 2 years ago
0

Hi,

Cost-wise the optimal solution is Amazon SageMaker Serverless Inference: for all details, see https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html

See final sentence below

Amazon SageMaker Serverless Inference is a purpose-built inference option that enables
you to deploy and scale ML models without configuring or managing any of the underlying 
 infrastructure. On-demand Serverless Inference is ideal for workloads which have idle 
periods between traffic spurts and can tolerate cold starts. Serverless endpoints automatically
 launch compute resources and scale them in and out depending on traffic, eliminating the need 
to choose instance types or manage scaling policies. This takes away the undifferentiated heavy 
lifting of selecting and managing servers. Serverless Inference integrates with AWS Lambda to 
offer you high availability, built-in fault tolerance and automatic scaling. With a pay-per-use model, 
Serverless Inference is a cost-effective option if you have an infrequent or unpredictable traffic pattern. 

During times when there are no requests, Serverless Inference scales your endpoint 
down to 0, helping you to minimize your costs

Best,

Didier

EXPERT
answered 2 years ago
0
AWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.