Sagemaker taking an unexpectedly long time to download training data

0

My customer's 220 Gb of training data took 54 minutes for Sagemaker to download. This is a rate of only 70 MB/s, which is unexpectedly slow. He is accessing the data in S3 from his p3.8xlarge instance through a private VPC endpoint, so the theoretical maximum bandwidth is 25 Gbps. Is there anything that can be done to speed up the download?

He started the Sagemaker training with the following function:

estimator = Estimator(image_name, role=role, output_path=output_location, train_instance_count=1, train_instance_type='ml.p3.8xlarge', train_volume_size=300, train_max_run = 52460*60 , security_group_ids='sg-00f1529adc4076841')

The output was: 2018-10-18 23:27:15 Starting - Starting the training job... Launching requested ML instances...... Preparing the instances for training... 2018-10-18 23:29:15 Downloading - Downloading input data............ .................................................................... .................................................................... .................................................................... 2018-10-19 00:23:50 Training - Downloading the training image..

Dataset download took ~54mins

AWS
已提問 6 年前檢視次數 1820 次
1 個回答
0
已接受的答案

How are they connect to S3? are they using a VPC endpoint / NAT? If they are using a VPC endpoint, My recommendation will be the open a support ticket, it's possible that support will be able to look at the network logs.

Another option for the customer is to use pipe input, pipe mode is recommended for large datasets, and it'll shorter their startup time because the data is being streamed instead of being downloaded to your training instances.

已回答 5 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南