How should a custom SageMaker algorithm determine if checkpoints are enabled?

0

Per the SageMaker Environment Variables doc, algorithms should save model artifacts to the folder prescribed by SM_MODEL_DIR.

The SageMaker Containers doc describes additional environment variables, including SM_OUTPUT_DATA_DIR to write non-model training artifacts.

...But how should the algorithm determine if checkpointing has been requested?

The Using Checkpoints in Amazon SageMaker doc only specifies a default local path to save them to, and I can't see any environment variables that would indicate whether or not to checkpoint. I've seen one piece of code checking for the existence of that default local path, but not convinced anybody's actually checked to see whether it works (is present when checkpointing is requested and absent when not).

It's good to parameterize Checkpointing to avoid wasting EBS space (and precious seconds of IO) in jobs when it's not needed; and by the conventions for other I/O like model and data folders I would assume SageMaker to have a specific mechanism to pass this instruction, rather than just defining an algo hyperparameter?

AWS
전문가
Alex_T
질문됨 4년 전402회 조회
1개 답변
0
수락된 답변

Hi,

For custom Sagemaker containers or deep learning frameworks, I tend to do this..and it works

This example is for pytorch I have tried

  • entry point file:

    # 1. Define a custom argument, say checkpointdir
     parser.add_argument("--checkpointdir", help="The checkpoint dir", type=str,
                         default=None)
    # 2. You can additional params for checkpoint frequency etc
    
    # 3. Code for checkpointing
    if checkpointdir is not None:
       #TODO: save mode
     
  • Sagemaker estimator in Jupyter notebook, for. e.g.

# 1. Define local and remote variables for checkpoints
checkpoint_s3 = "s3://{}/{}t/".format(bucket, "checkpoints")
localcheckpoint_dir="/opt/ml/checkpoints/"

hyperparameters = {

    "batchsize": "8",
    "epochs" : "1000",
    "learning_rate":.0001,
    "weight_decay":5e-5,
    "momentum":.9,
    "patience": 20,
    "log-level" : "INFO",
    "commit_id":commit_id,
    "model" :"FasterRcnnFactory",
    "accumulation_steps": 8,
# 2.  define hp for checkpoint dir
    "checkpointdir": localcheckpoint_dir
}

# In the Sagemaker estimator fit, specify the local and remote path
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
     entry_point='experiment_train.py',
                    source_dir = 'src',
                    dependencies =['src/datasets', 'src/evaluators', 'src/models'],
                    role=role,
                    framework_version ="1.0.0",
                    py_version='py3',
                    git_config= git_config,
                    image_name= docker_repo,
                    train_instance_count=1,
                    train_instance_type=instance_type,
# 3. The entrypoint file will pick up the checkpoint location from here
                    hyperparameters =hyperparameters,
                    output_path=s3_output_path,
                    metric_definitions=metric_definitions,
                    train_use_spot_instances = use_spot,
                    train_max_run =  train_max_run_secs,
                    train_max_wait = max_wait_time_secs,   
                    base_job_name ="object-detection",
# 4. Sagemaker knows that the checkpoints will need to be periodically copied from the localcheckpoint_dir to s3 pointed to by checkpoint_s3
                    checkpoint_s3_uri=checkpoint_s3,
                    checkpoint_local_path=localcheckpoint_dir)
AWS
전문가
답변함 4년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠