How to checkpoint SageMaker model artifact during a training job?

0

Hi,

Is there a way to regularly checkpoint model artifact in a SageMaker training job for BYO training container?

AWS
專家
已提問 4 年前檢視次數 978 次
1 個回答
0
已接受的答案

If you specify a checkpoint configuration (regardless of managed spot training) when starting a training job, checkpointing will work. You can provide a local path and S3 path as follows (API reference):

"CheckpointConfig": { 
  "LocalPath": "string",
  "S3Uri": "string"
}

The local path defaults to /opt/ml/checkpoints/, and then you specify the target path in S3 with S3Uri.

Given this configuration, SageMaker will configure an output channel with Continuous upload mode to Amazon S3. At the time being, this results in running an agent on the hosts that watches the file system and continuously uploads data to Amazon S3. Similar behavior is applied when debugging is enabled, for delivering tensor data to Amazon S3.

As commented, sagemaker-containers implements its own code to save intermediate outputs and watching files on the file system, but I would rather rely on the functionality offered by the service to avoid dependencies on specific libraries where possible.

Note: when using SageMaker Processing, which in my view can be considered an abstraction over training or, from another perspective, the foundation for training, you can configure an output channel to use continuous upload mode; further info here.

AWS
專家
已回答 4 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南