How to checkpoint SageMaker model artifact during a training job?

0

Hi,

Is there a way to regularly checkpoint model artifact in a SageMaker training job for BYO training container?

AWS
전문가
질문됨 4년 전972회 조회
1개 답변
0
수락된 답변

If you specify a checkpoint configuration (regardless of managed spot training) when starting a training job, checkpointing will work. You can provide a local path and S3 path as follows (API reference):

"CheckpointConfig": { 
  "LocalPath": "string",
  "S3Uri": "string"
}

The local path defaults to /opt/ml/checkpoints/, and then you specify the target path in S3 with S3Uri.

Given this configuration, SageMaker will configure an output channel with Continuous upload mode to Amazon S3. At the time being, this results in running an agent on the hosts that watches the file system and continuously uploads data to Amazon S3. Similar behavior is applied when debugging is enabled, for delivering tensor data to Amazon S3.

As commented, sagemaker-containers implements its own code to save intermediate outputs and watching files on the file system, but I would rather rely on the functionality offered by the service to avoid dependencies on specific libraries where possible.

Note: when using SageMaker Processing, which in my view can be considered an abstraction over training or, from another perspective, the foundation for training, you can configure an output channel to use continuous upload mode; further info here.

AWS
전문가
답변함 4년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인