How to checkpoint SageMaker model artifact during a training job?

0

Hi,

Is there a way to regularly checkpoint model artifact in a SageMaker training job for BYO training container?

AWS
专家
已提问 4 年前941 查看次数
1 回答
0
已接受的回答

If you specify a checkpoint configuration (regardless of managed spot training) when starting a training job, checkpointing will work. You can provide a local path and S3 path as follows (API reference):

"CheckpointConfig": { 
  "LocalPath": "string",
  "S3Uri": "string"
}

The local path defaults to /opt/ml/checkpoints/, and then you specify the target path in S3 with S3Uri.

Given this configuration, SageMaker will configure an output channel with Continuous upload mode to Amazon S3. At the time being, this results in running an agent on the hosts that watches the file system and continuously uploads data to Amazon S3. Similar behavior is applied when debugging is enabled, for delivering tensor data to Amazon S3.

As commented, sagemaker-containers implements its own code to save intermediate outputs and watching files on the file system, but I would rather rely on the functionality offered by the service to avoid dependencies on specific libraries where possible.

Note: when using SageMaker Processing, which in my view can be considered an abstraction over training or, from another perspective, the foundation for training, you can configure an output channel to use continuous upload mode; further info here.

AWS
专家
已回答 4 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则