Distributed DL Training on Spot Instances with SageMaker

0

Hi,

The documentation on SageMaker suggests that one can do distributed deep learning training (multi-node) [1]. It is also possible to use Spot instances with Sagemaker [2]. Is it possible to combine these features and do multi-node distributed training on Spot instances? If yes, what is the failure semantics whenever peers drop out before submitting their gradients? I could not find any documentation on that matter.

[1] https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html [2] https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html

Alex
已提问 10 个月前142 查看次数
没有答案

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则