Downloading Sagemaker training image takes 1 hour

0

For the last few days, my training jobs have blown out, and the logs are showing over 1 hour to download the training job. I'm using spot instances for training - is this a symptom of that? It seems unlikely because I'd assumed if a spot instance wasn't available I'd get some other error, or at least it wouldn't have started preparing the instances? I'm using the HuggingFace estimator with the following

       transformers_version="4.28",  # Transformers version
        pytorch_version="2.0",  # PyTorch version
        py_version="py310",  # Python version
16:10:41  2023-10-10 05:10:10 Starting - Starting the training job...
16:11:41  2023-10-10 05:10:29 Starting - Preparing the instances for training......
16:12:11  2023-10-10 05:11:26 Downloading - Downloading input data...
16:19:14  2023-10-10 05:11:47 Training - Downloading the training image..........................................
17:27:40  2023-10-10 05:18:44 Training - Training image download completed. Training in progress.........................................................................................................................................................................................................................................................................................................................................................................................................................
17:28:42  2023-10-10 06:27:30 Uploading - Uploading generated training model......
17:28:42  2023-10-10 06:28:16 Completed - Training job completed
Dave
질문됨 7달 전305회 조회
1개 답변
0

Actually, now I look closer I think the logs are "wrong" - the total time spent "training" was only 1 minute, yet it normally takes 1 hour to train with checkpoints or minimum 10 minutes if I kick off the same job using checkpoints. So perhaps the log isn't being flushed correctly or something.

Dave
답변함 7달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠