How to access file system in Sagemaker notebook instance from outside of that instance (ie via Python Sagemaker Estimator training call)



I have large image dataset stored in a Sagemaker notebook instance, in the file system. I was hoping to learn how I could access this data from outside of that particular notebook instance. I have done quite a bit of researching but can't seem to find much - I am relatively new to this.

I want to be able to access the data in that notebook in a fast manner as I will be using the data to train an AI model. Is there any recommended way to do this?

I originally uploaded the data within that notebook instance to train a model within that instance in exactly the same file system. Note that it is a reasonably large dataset which I had to do some preprocessing on within Sagemaker.

What is the best way to store data when using the Sagemaker estimators from training AI models?

Many thanks


asked 2 years ago3042 views
1 Answer
Accepted Answer

Hi Tim, when you create a sagemaker training job using the estimator, the general best practice is to store your data on S3 and the training job will launch instances as requested by the training job configuration. As now we support fast file mode, which allows faster training job start compared to the file mode (which downloads the data from s3 to the training instance). But when you say you used sagemaker notebook instance to train the model, I assume you were not using SageMaker Training jobs but rather running the notebook (.ipynb) on the SageMaker notebook instance. Please note that as SageMaker is a fully managed service, the notebook instance (also training instances, hosting instances etc.) are launched in the service account, so you will not have directly access to those instance. The SageMaker notebook instance use EBS to store data and the EBS volume is mounted to the /home/ec2-user/SageMaker. Please note that the EBS volume used by a SageMaker notebook instance can only be increased but not decrease. If you want to reduce the EBS volume, you need to create a new notebook instance with a smaller volume and move your data from the previous instance via s3. You will not be able to access that EBS volume from outside of the SageMaker notebook instance. The general best practice is to store large dataset on s3 and only use sample data on the SageMaker notebook instance (reduce the storage). Then use that small amount of sample data to test/build your code. Then when you are ready to train on the whole dataset, you can launch a SageMaker training job and use the whole dataset stored on s3. Note that, running the training on the whole dataset on a SageMaker notebook instance will require you to use a big instance with enough computing power and also will not be able to perform distributed training with multiple instances. Comparatively, if you run the training job use SageMaker training instances, it gives you more flexibility of choosing the instance type and allow you to run on multiple instances for distributed training. Lastly, once the SageMaker training job is done, all the resources will be terminated which will save cost compared to continue using the big instance with a SageMaker notebook instance. Hope this has helped answer your question

answered 2 years ago
  • That's all great advice - thank you, appreciate it!

  • Hi Melanie, Just wondering if you could let me know if thers a preferable way to create a Sagemaker training job - should I go about it by using sagemaker.session.train(), or by creating an estimator (PyTorch estimator in my case) and then calling

    Many thanks Tim

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions