How to checkpoint SageMaker model artifact during a training job?

0

Hi,

Is there a way to regularly checkpoint model artifact in a SageMaker training job for BYO training container?

AWS
エキスパート
質問済み 4年前979ビュー
1回答
0
承認された回答

If you specify a checkpoint configuration (regardless of managed spot training) when starting a training job, checkpointing will work. You can provide a local path and S3 path as follows (API reference):

"CheckpointConfig": { 
  "LocalPath": "string",
  "S3Uri": "string"
}

The local path defaults to /opt/ml/checkpoints/, and then you specify the target path in S3 with S3Uri.

Given this configuration, SageMaker will configure an output channel with Continuous upload mode to Amazon S3. At the time being, this results in running an agent on the hosts that watches the file system and continuously uploads data to Amazon S3. Similar behavior is applied when debugging is enabled, for delivering tensor data to Amazon S3.

As commented, sagemaker-containers implements its own code to save intermediate outputs and watching files on the file system, but I would rather rely on the functionality offered by the service to avoid dependencies on specific libraries where possible.

Note: when using SageMaker Processing, which in my view can be considered an abstraction over training or, from another perspective, the foundation for training, you can configure an output channel to use continuous upload mode; further info here.

AWS
エキスパート
回答済み 4年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ