By using AWS re:Post, you agree to the Terms of Use

OSError: [Errno 28] No space left on device -- PyTorch, CNN, estimator

0

hey there, I'm training a convolutional neural network (CNN) on a large dataset (10k images - 50 GB) stored on S3 bucket using estimator(sagemaker infrastructure) . everything works well when I work with 2000 images which has the total size of almost 5-10 GB. however, I get an error when I increase number of images to 3000 or more. the error indicates that there is no space left on device. I also attached the estimator setup to this message, as you can see I am using ml.g4dn.12xlarge instance which has 192 GB of memory!! I also increased the volume size to 900 GB. I still don't know why I am getting space/storage error!! I know that error is related the function "_get_train_data_loader" in which it is trying to download the images!! I was reading somewhere that EFS (elastic file system) might help with this issue, if so, I don't know how to specify it in the estimator. estimator = PyTorch( entry_point="pbdl_sm.py", role=role, framework_version="1.4.0", py_version="py3", instance_count=1, instance_type="ml.g4dn.12xlarge", volume_size = 900, hyperparameters={"epochs": 6, "backend": "gloo","lr": 0.001,"train_size":2900,"n_realz":3000}, )

1 Answers
0

Hi

Thank you for reaching out to us.

In general the no space left on device error occurs when there is a high disk utilization, Requesting you to review the instance metrics and cloudwatch metrics/logs of the training job for more detailed information.

It also depends on the various factors like learning rate, number epochs and configuration of the training job and the estimator, However I would recommend you to try distributed training with the pytorch using the smdistributed [1] on multiple instances , Currently, the following are supported: distributed training with parameter servers, SageMaker Distributed (SMD) Data and Model Parallelism, and MPI. SMD Model Parallelism can only be used with MPI.

To enable the SageMaker distributed data parallelism: { "smdistributed": { "dataparallel": { "enabled": True } } }

I would also recommend to try the PIPE mode, With Pipe input mode, your dataset is streamed directly to your training instances instead of being downloaded first. This means that your training jobs start sooner, finish quicker, and need less disk space.

If you are facing any issues and require further investigation on the issue, I would encourage you to open a case with the premium support along with the training job ARN and the cloudwatch logs of the job.Due to security reason, this post is not suitable for sharing customer's resource.

Reference:

[1] https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html

[2] https://aws.amazon.com/blogs/machine-learning/using-pipe-input-mode-for-amazon-sagemaker-algorithms/

[3] https://github.com/aws/amazon-sagemaker-examples/blob/80df7d61a4bf14a11f0442020e2003a7c1f78115/advanced_functionality/pipe_bring_your_own/train.py

[4] https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/pytorch/data_parallel/maskrcnn/pytorch_smdataparallel_maskrcnn_demo.html

answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions