Downloading Sagemaker training image takes 1 hour

0

For the last few days, my training jobs have blown out, and the logs are showing over 1 hour to download the training job. I'm using spot instances for training - is this a symptom of that? It seems unlikely because I'd assumed if a spot instance wasn't available I'd get some other error, or at least it wouldn't have started preparing the instances? I'm using the HuggingFace estimator with the following

       transformers_version="4.28",  # Transformers version
        pytorch_version="2.0",  # PyTorch version
        py_version="py310",  # Python version
16:10:41  2023-10-10 05:10:10 Starting - Starting the training job...
16:11:41  2023-10-10 05:10:29 Starting - Preparing the instances for training......
16:12:11  2023-10-10 05:11:26 Downloading - Downloading input data...
16:19:14  2023-10-10 05:11:47 Training - Downloading the training image..........................................
17:27:40  2023-10-10 05:18:44 Training - Training image download completed. Training in progress.........................................................................................................................................................................................................................................................................................................................................................................................................................
17:28:42  2023-10-10 06:27:30 Uploading - Uploading generated training model......
17:28:42  2023-10-10 06:28:16 Completed - Training job completed
Dave
已提問 7 個月前檢視次數 304 次
1 個回答
0

Actually, now I look closer I think the logs are "wrong" - the total time spent "training" was only 1 minute, yet it normally takes 1 hour to train with checkpoints or minimum 10 minutes if I kick off the same job using checkpoints. So perhaps the log isn't being flushed correctly or something.

Dave
已回答 7 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南