confusion about PIPE mode when using S3 shard key

0

Hi,

I am a little confused about whether S3 Shard key would work when using PIPE mode, here is a example:

Assume I have:

2 instance, each instance have 4 worker;

data: total 8 files with total size 8GB, each file is 1GB. Put them into 4 different S3 path, that means, each path has 2 files (2GB in total)

If I use PIPE mode, and s3_input using distribution='ShardedByS3Key', and create 4 channel (each channel mapping a s3 path, 2 files)

train_s3_input_1 = sagemaker.inputs.s3_input(channel_1, distribution='ShardedByS3Key')

Question:

How much data of each worker get to train, 1 file or 2 files? thanks

AWS
gefragt vor 4 Jahren233 Aufrufe
1 Antwort
0
Akzeptierte Antwort

Hi, SageMaker will replicate a subset of data (1/n ML compute instances) on each ML compute instance that is launched for model training when you specify ShardedByS3Key. If there are n ML compute instances launched for a training job, each instance gets approximately 1/n of the number of S3 objects. This applies in both File and Pipe modes. Keep this in mind when developing algorithms.

To answer your question: How much data of each worker get to train, 1 file or 2 files? 1 file each from the training channel.

AWS
Will_B
beantwortet vor 4 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen