Skip to content

Does sagemaker real-time inference benefit from capacity reservation?

0

I need to deploy a sagemaker endpoint for inference that uses a GPU-instance, and at certain times it fails with error: "Unable to provision requested ML compute capacity due to InsufficientInstanceCapacity error. Please retry using a different ML instance type or after some time." If i create a capacity reservation for the instance I need, will sagemaker be able to consume the reserved instance? Or are capacity reservations are an EC2 construct only?

1 Answer
0

Yes, Amazon SageMaker real-time inference can benefit from capacity reservation through two main approaches:

  1. SageMaker AI's Flexible Training Plans (FTP): This feature now supports inference endpoints, allowing you to reserve specific GPU instance types for planned evaluations and production peaks. With FTP, you can choose your preferred instance types, compute requirements, reservation length, and start date for your inference workload. When creating the endpoint, you reference the reservation ARN, and SageMaker AI automatically provisions and runs the endpoint on that guaranteed capacity for the duration of your plan. This helps address the "InsufficientInstanceCapacity" error you're experiencing by ensuring GPU availability when you need it.

When configuring your endpoint with FTP, you can use the CapacityReservationPreference setting with the "capacity-reservations-only" option, which ensures SageMaker launches instances only into your ML capacity reservation.

  1. On-Demand Capacity Reservations (ODCRs): These allow you to reserve compute capacity for specific instance types in a specific Availability Zone. ODCRs provide uninterrupted access to accelerated instances (GPU, Trainium, or Inferentia) that you reserve, which is particularly valuable for ML workloads with strict capacity requirements.

By using either of these capacity reservation options, you can avoid the "InsufficientInstanceCapacity" error for your GPU-based inference endpoint, especially during times of high demand. The FTP approach is specifically designed for SageMaker inference workloads, while ODCRs are a more general EC2 construct that can also benefit SageMaker deployments.

FTP for inference endpoints is currently available in US East (N. Virginia), US West (Oregon), and US East (Ohio) regions.
Sources
Amazon SageMaker AI now supports Flexible Training Plans capacity for Inference - AWS
ProductionVariantCapacityReservationConfig - Amazon SageMaker
Optimizing cost for building AI models with Amazon EC2 and SageMaker AI | AWS Cloud Financial Management
Real-time inference - Amazon SageMaker AI

answered 4 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.