Distributed DL Training on Spot Instances with SageMaker

0

Hi,

The documentation on SageMaker suggests that one can do distributed deep learning training (multi-node) [1]. It is also possible to use Spot instances with Sagemaker [2]. Is it possible to combine these features and do multi-node distributed training on Spot instances? If yes, what is the failure semantics whenever peers drop out before submitting their gradients? I could not find any documentation on that matter.

[1] https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html [2] https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html

Alex
질문됨 10달 전142회 조회
답변 없음

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인