Sagemaker taking an unexpectedly long time to download training data

0

My customer's 220 Gb of training data took 54 minutes for Sagemaker to download. This is a rate of only 70 MB/s, which is unexpectedly slow. He is accessing the data in S3 from his p3.8xlarge instance through a private VPC endpoint, so the theoretical maximum bandwidth is 25 Gbps. Is there anything that can be done to speed up the download?

He started the Sagemaker training with the following function:

estimator = Estimator(image_name, role=role, output_path=output_location, train_instance_count=1, train_instance_type='ml.p3.8xlarge', train_volume_size=300, train_max_run = 52460*60 , security_group_ids='sg-00f1529adc4076841')

The output was: 2018-10-18 23:27:15 Starting - Starting the training job... Launching requested ML instances...... Preparing the instances for training... 2018-10-18 23:29:15 Downloading - Downloading input data............ .................................................................... .................................................................... .................................................................... 2018-10-19 00:23:50 Training - Downloading the training image..

Dataset download took ~54mins

asked 4 years ago665 views
1 Answer
0
Accepted Answer

How are they connect to S3? are they using a VPC endpoint / NAT? If they are using a VPC endpoint, My recommendation will be the open a support ticket, it's possible that support will be able to look at the network logs.

Another option for the customer is to use pipe input, pipe mode is recommended for large datasets, and it'll shorter their startup time because the data is being streamed instead of being downloaded to your training instances.

answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions