Spot instances for HPC workload: use cases

0

Has anyone ever used spot instances for HPC workload? Do we have any public reference use-case?

IFAIK, Spot can be used if the job can checkpoint his status (so we can restart in case of a termination). Otherwise if the job runs for many hours, you run the risk to have your job terminated before it’s completed.

Looking forward to your comments.

I'll appreciate if you can share links to public use-cases or guides.

AWS
질문됨 7년 전467회 조회
2개 답변
1

See the TLG Aerospace use case study for tightly-coupled: https://aws.amazon.com/solutions/case-studies/tlg-aerospace/

Fermi labs for HTC: https://aws.amazon.com/blogs/aws/experiment-that-discovered-the-higgs-boson-uses-aws-to-probe-nature/

Spot is often used for both high throughput computing HPC and tightly-coupled HPC. This is because most HPC workloads are short lived and already have check pointing. Long time HPC users are used to less reliable environments or forced check pointing to get off a supercomputer within a certain number of hours.

While a lot of HPC is run on spot, unless the job completes within two minutes of notice, many users are OK with losing a job. Those that aren't, can checkpoint every so often, and relaunch from the EBS volume. Still, while users could recover most jobs, they often don't bother to try to save a case. They check spot prices beforehand, bid at a higher value, and rerun the job if the spot price is exceeded. This is because relatively few jobs are lost when a bit of care is taken, the savings are already large, and relaunching a lost job is easy.

중재자
답변함 7년 전
0
수락된 답변

This blog post describes how to use Spot instances for CAE workload such as LS-DYNA using checkpointing: https://aws.amazon.com/blogs/hpc/cost-optimization-on-spot-instances-using-checkpoints-for-ansys-ls-dyna/

AWS
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠