Hibernating Spot Instances upon interruption in Amazon EKS

0

A SaaS provider offers a service that enables customers to launch long-running jobs. These jobs are placed in containers and deployed to EC2 Spot instances using EKS. The provider does not want to terminate the jobs and restart them from scratch if the Spot instance is terminated, but instead have its state persisted and resume where it left off on the new Spot instance using hibernation.

Do we have any working examples of this? The challenges I envision are related to configuring K8s to not replace the hibernated Spot instance or its containers, as we want to instead resume them. Also handling any side-effects related to have K8s processes running on the resumed EC2 instance. I suspect an external system will need to then watch for job completion and terminate the node accordingly. K8s thus would only be concerned with creating new Spot nodes and placing new jobs, without recovering/replacing failed ones.

AWS
Jay_M
질문됨 4년 전729회 조회
1개 답변
1
수락된 답변

There's no built-in facility in Kubernetes to suspend/resume pods. Customers who desire to handle long-running tasks on Spot instances should ensure that their applications perform checkpointing using some sort of persistent storage such as EBS, EFS, or S3. When the pod is rescheduled, it can restore its checkpointed state and resume processing.

Building with resiliency in mind is a good idea even for non-Spot workloads, since hardware can and does fail.

Lastly, it is worth mentioning that by following Spot best practices, they can configure their node groups in a way that would minimize the number of Spot interruptions that they have. This can be done by adding multiple instance types and using the capacity-optimized allocation strategy in their EC2 Auto Scaling groups. If they are using EKS managed node groups with Spot Instances, this is already configured out of the box.

AWS
전문가
답변함 4년 전
profile picture
전문가
검토됨 한 달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠