How to checkpoint SageMaker model artifact during a training job?

0

Hi,

Is there a way to regularly checkpoint model artifact in a SageMaker training job for BYO training container?

AWS
EXPERT
demandé il y a 4 ans970 vues
1 réponse
0
Réponse acceptée

If you specify a checkpoint configuration (regardless of managed spot training) when starting a training job, checkpointing will work. You can provide a local path and S3 path as follows (API reference):

"CheckpointConfig": { 
  "LocalPath": "string",
  "S3Uri": "string"
}

The local path defaults to /opt/ml/checkpoints/, and then you specify the target path in S3 with S3Uri.

Given this configuration, SageMaker will configure an output channel with Continuous upload mode to Amazon S3. At the time being, this results in running an agent on the hosts that watches the file system and continuously uploads data to Amazon S3. Similar behavior is applied when debugging is enabled, for delivering tensor data to Amazon S3.

As commented, sagemaker-containers implements its own code to save intermediate outputs and watching files on the file system, but I would rather rely on the functionality offered by the service to avoid dependencies on specific libraries where possible.

Note: when using SageMaker Processing, which in my view can be considered an abstraction over training or, from another perspective, the foundation for training, you can configure an output channel to use continuous upload mode; further info here.

AWS
EXPERT
répondu il y a 4 ans

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions