Do I have to redownload dataset to training job every time I run a Sagemaker Estimator training job?


Hi, Over the coming weeks I'll be running some deep learning experiments using the PyTorch Sagemaker estimator, and I was wondering if it would be possible to avoid re-downloading my dataset every time I call

Is there a way to do this without using FastFile mode - ie downloading the dataset once and using the same docker image?

If it's not possible to do it with online instances, would it be possible to re-use the docker instance used if I was to run it in local mode (ie instance_type='local_gpu') - if so, how?

And just to add, I am using S3 for the input data.

Many thanks, Tim

1 Answer

Hi Tim, SageMaker training job will need to download/stream in data from S3. Currently by default, the training job's input data config is file mode which means the data will be downloaded from s3. We have launched a new mode called fast file mode which will stream data in while the job runs. If you are aware of the pipe mode, the fast file mode is a combination of file mode and pipe mode, which streams data in to the training instance without any code change. Please refer to the what's new doc To use the fast file mode, you just simply change the configuration of your estimator according ( Additionally, SageMaker training job does support other data storage source other than S3. You can use EFS or FSx for Lustre to speed up your training by eliminating the need to download data as used in file mode. You can refer to the blog here

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions