1 回答
- 最新
- 投票最多
- 评论最多
0
Hi,
For custom Sagemaker containers or deep learning frameworks, I tend to do this..and it works
This example is for pytorch I have tried
-
entry point file:
# 1. Define a custom argument, say checkpointdir parser.add_argument("--checkpointdir", help="The checkpoint dir", type=str, default=None) # 2. You can additional params for checkpoint frequency etc # 3. Code for checkpointing if checkpointdir is not None: #TODO: save mode
-
Sagemaker estimator in Jupyter notebook, for. e.g.
# 1. Define local and remote variables for checkpoints checkpoint_s3 = "s3://{}/{}t/".format(bucket, "checkpoints") localcheckpoint_dir="/opt/ml/checkpoints/" hyperparameters = { "batchsize": "8", "epochs" : "1000", "learning_rate":.0001, "weight_decay":5e-5, "momentum":.9, "patience": 20, "log-level" : "INFO", "commit_id":commit_id, "model" :"FasterRcnnFactory", "accumulation_steps": 8, # 2. define hp for checkpoint dir "checkpointdir": localcheckpoint_dir } # In the Sagemaker estimator fit, specify the local and remote path from sagemaker.pytorch import PyTorch estimator = PyTorch( entry_point='experiment_train.py', source_dir = 'src', dependencies =['src/datasets', 'src/evaluators', 'src/models'], role=role, framework_version ="1.0.0", py_version='py3', git_config= git_config, image_name= docker_repo, train_instance_count=1, train_instance_type=instance_type, # 3. The entrypoint file will pick up the checkpoint location from here hyperparameters =hyperparameters, output_path=s3_output_path, metric_definitions=metric_definitions, train_use_spot_instances = use_spot, train_max_run = train_max_run_secs, train_max_wait = max_wait_time_secs, base_job_name ="object-detection", # 4. Sagemaker knows that the checkpoints will need to be periodically copied from the localcheckpoint_dir to s3 pointed to by checkpoint_s3 checkpoint_s3_uri=checkpoint_s3, checkpoint_local_path=localcheckpoint_dir)
相关内容
- AWS 官方已更新 1 年前
- AWS 官方已更新 1 年前
- AWS 官方已更新 2 年前