How to verify that checkpoints work for SageMaker Spot Training?

0

Hi,

How can we know that checkpoint works before launching a sagemaker spot training job? Is there a way to force a regular checkpoint to s3 instead of waiting for the SIGTERM?

cheers

AWS
專家
已提問 4 年前檢視次數 330 次
1 個回答
0
已接受的答案

Hi olivier, If you enable Sagemaker checkpointing , it periodically saves a copy of the artifacts into S3. I have used this in pytorch and it works by checkpointing periodically and the blog on Managed Spot Training: Save Up to 90% On Your Amazon SageMaker Training Jobs also mentions the same

To avoid restarting a training job from scratch should it be interrupted, we strongly recommend that you implement checkpointing, a technique that saves the model in training at periodic intervals

AWS
專家
已回答 4 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南