confusion about PIPE mode when using S3 shard key

Question

Hi,

I am a little confused about whether S3 Shard key would work when using PIPE mode, here is a example:

Assume I have:

2 instance, each instance have 4 worker;

data: total 8 files with total size 8GB, each file is 1GB. Put them into 4 different S3 path, that means, each path has 2 files (2GB in total)

If I use PIPE mode, and s3_input using  distribution='ShardedByS3Key', and create 4 channel (each channel mapping a s3 path, 2 files)

train_s3_input_1 = sagemaker.inputs.s3_input(channel_1, distribution='ShardedByS3Key')

Question:

How much data of each worker get to train, 1 file or 2 files? thanks

Accepted Answer

Hi,
SageMaker will replicate a subset of data (1/n ML compute instances) on each ML compute instance that is launched for model training when you specify *ShardedByS3Key*. If there are n ML compute instances launched for a training job, each instance gets approximately 1/n of the number of S3 objects. This applies in both File and Pipe modes. Keep this in mind when developing algorithms.

To answer your question:
How much data of each worker get to train, 1 file or 2 files? 1 file each from the training channel.

confusion about PIPE mode when using S3 shard key

Contenuto pertinente