- Le plus récent
- Le plus de votes
- La plupart des commentaires
See the TLG Aerospace use case study for tightly-coupled: https://aws.amazon.com/solutions/case-studies/tlg-aerospace/
Fermi labs for HTC: https://aws.amazon.com/blogs/aws/experiment-that-discovered-the-higgs-boson-uses-aws-to-probe-nature/
Spot is often used for both high throughput computing HPC and tightly-coupled HPC. This is because most HPC workloads are short lived and already have check pointing. Long time HPC users are used to less reliable environments or forced check pointing to get off a supercomputer within a certain number of hours.
While a lot of HPC is run on spot, unless the job completes within two minutes of notice, many users are OK with losing a job. Those that aren't, can checkpoint every so often, and relaunch from the EBS volume. Still, while users could recover most jobs, they often don't bother to try to save a case. They check spot prices beforehand, bid at a higher value, and rerun the job if the spot price is exceeded. This is because relatively few jobs are lost when a bit of care is taken, the savings are already large, and relaunching a lost job is easy.
This blog post describes how to use Spot instances for CAE workload such as LS-DYNA using checkpointing: https://aws.amazon.com/blogs/hpc/cost-optimization-on-spot-instances-using-checkpoints-for-ansys-ls-dyna/
Contenus pertinents
- demandé il y a 8 mois
- demandé il y a 2 mois
- demandé il y a 6 mois
- AWS OFFICIELA mis à jour il y a un an
- AWS OFFICIELA mis à jour il y a un an
- AWS OFFICIELA mis à jour il y a un an
- Pourquoi mon instance Spot est-elle résiliée, alors que le prix maximum est supérieur au prix Spot ?AWS OFFICIELA mis à jour il y a 2 ans