confusion about PIPE mode when using S3 shard key

0

Hi,

I am a little confused about whether S3 Shard key would work when using PIPE mode, here is a example:

Assume I have:

2 instance, each instance have 4 worker;

data: total 8 files with total size 8GB, each file is 1GB. Put them into 4 different S3 path, that means, each path has 2 files (2GB in total)

If I use PIPE mode, and s3_input using distribution='ShardedByS3Key', and create 4 channel (each channel mapping a s3 path, 2 files)

train_s3_input_1 = sagemaker.inputs.s3_input(channel_1, distribution='ShardedByS3Key')

Question:

How much data of each worker get to train, 1 file or 2 files? thanks

AWS
asked 4 years ago303 views
1 Answer
0
Accepted Answer

Hi, SageMaker will replicate a subset of data (1/n ML compute instances) on each ML compute instance that is launched for model training when you specify ShardedByS3Key. If there are n ML compute instances launched for a training job, each instance gets approximately 1/n of the number of S3 objects. This applies in both File and Pipe modes. Keep this in mind when developing algorithms.

To answer your question: How much data of each worker get to train, 1 file or 2 files? 1 file each from the training channel.

AWS
Will_B
answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions