Downloading Sagemaker training image takes 1 hour

0

For the last few days, my training jobs have blown out, and the logs are showing over 1 hour to download the training job. I'm using spot instances for training - is this a symptom of that? It seems unlikely because I'd assumed if a spot instance wasn't available I'd get some other error, or at least it wouldn't have started preparing the instances? I'm using the HuggingFace estimator with the following

       transformers_version="4.28",  # Transformers version
        pytorch_version="2.0",  # PyTorch version
        py_version="py310",  # Python version
16:10:41  2023-10-10 05:10:10 Starting - Starting the training job...
16:11:41  2023-10-10 05:10:29 Starting - Preparing the instances for training......
16:12:11  2023-10-10 05:11:26 Downloading - Downloading input data...
16:19:14  2023-10-10 05:11:47 Training - Downloading the training image..........................................
17:27:40  2023-10-10 05:18:44 Training - Training image download completed. Training in progress.........................................................................................................................................................................................................................................................................................................................................................................................................................
17:28:42  2023-10-10 06:27:30 Uploading - Uploading generated training model......
17:28:42  2023-10-10 06:28:16 Completed - Training job completed
Dave
asked 7 months ago296 views
1 Answer
0

Actually, now I look closer I think the logs are "wrong" - the total time spent "training" was only 1 minute, yet it normally takes 1 hour to train with checkpoints or minimum 10 minutes if I kick off the same job using checkpoints. So perhaps the log isn't being flushed correctly or something.

Dave
answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions