Sagemaker taking an unexpectedly long time to download training data

0

My customer's 220 Gb of training data took 54 minutes for Sagemaker to download. This is a rate of only 70 MB/s, which is unexpectedly slow. He is accessing the data in S3 from his p3.8xlarge instance through a private VPC endpoint, so the theoretical maximum bandwidth is 25 Gbps. Is there anything that can be done to speed up the download?

He started the Sagemaker training with the following function:

estimator = Estimator(image_name, role=role, output_path=output_location, train_instance_count=1, train_instance_type='ml.p3.8xlarge', train_volume_size=300, train_max_run = 52460*60 , security_group_ids='sg-00f1529adc4076841')

The output was: 2018-10-18 23:27:15 Starting - Starting the training job... Launching requested ML instances...... Preparing the instances for training... 2018-10-18 23:29:15 Downloading - Downloading input data............ .................................................................... .................................................................... .................................................................... 2018-10-19 00:23:50 Training - Downloading the training image..

Dataset download took ~54mins

AWS
質問済み 5年前1815ビュー
1回答
0
承認された回答

How are they connect to S3? are they using a VPC endpoint / NAT? If they are using a VPC endpoint, My recommendation will be the open a support ticket, it's possible that support will be able to look at the network logs.

Another option for the customer is to use pipe input, pipe mode is recommended for large datasets, and it'll shorter their startup time because the data is being streamed instead of being downloaded to your training instances.

回答済み 5年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ