Distributed DL Training on Spot Instances with SageMaker
0
Hi,
The documentation on SageMaker suggests that one can do distributed deep learning training (multi-node) [1]. It is also possible to use Spot instances with Sagemaker [2]. Is it possible to combine these features and do multi-node distributed training on Spot instances?
If yes, what is the failure semantics whenever peers drop out before submitting their gradients? I could not find any documentation on that matter.
You can run distributed training with spot, just specify use_spot=True. However, add periodic checkpoints (about every hour or so) if you're using spot instances - https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints.html