SageMaker Training Job Error - "Checkpoint hyperparameters are missing. Please check the checkpoint hyperparameters file exists on S3., exit code: 2"

0

Hi,

I am using SageMaker for a computer vision project. The project goal is to train an Object Detection model on SageMaker and create an Endpoint. We follow the AWS instructions to prepare a dataset having images files and *.manifest file created inside a new S3 bucket within the same region of the SageMaker notebook

We use the notebook (http://aws-tc-largeobjects.s3-us-west-2.amazonaws.com/DIG-TF-200-MLBEES-10-EN/demo.ipynb) which we download from a link provided by an AWS Youtube video (https://www.youtube.com/watch?v=OFlu6Gd7CrQ).

We followed the instructions to load the images and *.manifest file provided by the notebook ran the code and then created a Training job but failed many times with the following error:

"Failure reason ClientError: Cannot resume training. Checkpoint hyperparameters are missing. Please check the checkpoint hyperparameters file exists on S3., exit code: 2"

instance type used is p2.xlarge

I have no idea what this error means, and I have no idea what is a checkpoint hyperparameters file. I checked my S3 a hyperparameters file does not exist.

I checked and all hyperparameters are set correctly during job creation and here is the list report in the report:

Hyperparameters Key Value base_network resnet-50 early_stopping false early_stopping_min_epochs 10 early_stopping_patience 5 early_stopping_tolerance 0.0 epochs 30 freeze_layer_pattern false image_shape 300 label_width 350 learning_rate 0.001 lr_scheduler_factor 0.1 mini_batch_size 1 momentum 0.9 nms_threshold 0.45 num_classes 1 num_training_samples 400 optimizer adam overlap_threshold 0.5 use_pretrained_model 1 weight_decay 0.0005

Thanks for help!

已提問 1 年前檢視次數 179 次
沒有答案

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南