Sagemaker taking an unexpectedly long time to download training data

0

My customer's 220 Gb of training data took 54 minutes for Sagemaker to download. This is a rate of only 70 MB/s, which is unexpectedly slow. He is accessing the data in S3 from his p3.8xlarge instance through a private VPC endpoint, so the theoretical maximum bandwidth is 25 Gbps. Is there anything that can be done to speed up the download?

He started the Sagemaker training with the following function:

estimator = Estimator(image_name, role=role, output_path=output_location, train_instance_count=1, train_instance_type='ml.p3.8xlarge', train_volume_size=300, train_max_run = 52460*60 , security_group_ids='sg-00f1529adc4076841')

The output was: 2018-10-18 23:27:15 Starting - Starting the training job... Launching requested ML instances...... Preparing the instances for training... 2018-10-18 23:29:15 Downloading - Downloading input data............ .................................................................... .................................................................... .................................................................... 2018-10-19 00:23:50 Training - Downloading the training image..

Dataset download took ~54mins

AWS
질문됨 5년 전1805회 조회
1개 답변
0
수락된 답변

How are they connect to S3? are they using a VPC endpoint / NAT? If they are using a VPC endpoint, My recommendation will be the open a support ticket, it's possible that support will be able to look at the network logs.

Another option for the customer is to use pipe input, pipe mode is recommended for large datasets, and it'll shorter their startup time because the data is being streamed instead of being downloaded to your training instances.

답변함 5년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠