My customer's 220 Gb of training data took 54 minutes for Sagemaker to download. This is a rate of only 70 MB/s, which is unexpectedly slow. He is accessing the data in S3 from his p3.8xlarge instance through a private VPC endpoint, so the theoretical maximum bandwidth is 25 Gbps. Is there anything that can be done to speed up the download?
He started the Sagemaker training with the following function:
estimator = Estimator(image_name, role=role, output_path=output_location,
train_instance_count=1, train_instance_type='ml.p3.8xlarge',
train_volume_size=300, train_max_run = 52460*60 ,
security_group_ids='sg-00f1529adc4076841')
The output was:
2018-10-18 23:27:15 Starting - Starting the training job...
Launching requested ML instances......
Preparing the instances for training...
2018-10-18 23:29:15 Downloading - Downloading input data............
....................................................................
....................................................................
....................................................................
2018-10-19 00:23:50 Training - Downloading the training image..
Dataset download took ~54mins