- Newest
- Most votes
- Most comments
There's no built-in facility in Kubernetes to suspend/resume pods. Customers who desire to handle long-running tasks on Spot instances should ensure that their applications perform checkpointing using some sort of persistent storage such as EBS, EFS, or S3. When the pod is rescheduled, it can restore its checkpointed state and resume processing.
Building with resiliency in mind is a good idea even for non-Spot workloads, since hardware can and does fail.
Lastly, it is worth mentioning that by following Spot best practices, they can configure their node groups in a way that would minimize the number of Spot interruptions that they have. This can be done by adding multiple instance types and using the capacity-optimized allocation strategy in their EC2 Auto Scaling groups. If they are using EKS managed node groups with Spot Instances, this is already configured out of the box.
Relevant content
- Accepted Answerasked 6 years ago
- AWS OFFICIALUpdated 10 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago