OSError: [Errno 28] No space left on device -- PyTorch, CNN, estimator

0

hey there, I'm training a convolutional neural network (CNN) on a large dataset (10k images - 50 GB) stored on S3 bucket using estimator(sagemaker infrastructure) . everything works well when I work with 2000 images which has the total size of almost 5-10 GB. however, I get an error when I increase number of images to 3000 or more. the error indicates that there is no space left on device. I also attached the estimator setup to this message, as you can see I am using ml.g4dn.12xlarge instance which has 192 GB of memory!! I also increased the volume size to 900 GB. I still don't know why I am getting space/storage error!! I know that error is related the function "_get_train_data_loader" in which it is trying to download the images!! I was reading somewhere that EFS (elastic file system) might help with this issue, if so, I don't know how to specify it in the estimator. estimator = PyTorch( entry_point="pbdl_sm.py", role=role, framework_version="1.4.0", py_version="py3", instance_count=1, instance_type="ml.g4dn.12xlarge", volume_size = 900, hyperparameters={"epochs": 6, "backend": "gloo","lr": 0.001,"train_size":2900,"n_realz":3000}, )

已提问 2 年前2279 查看次数
2 回答
0

Hi

Thank you for reaching out to us.

In general the no space left on device error occurs when there is a high disk utilization, Requesting you to review the instance metrics and cloudwatch metrics/logs of the training job for more detailed information.

It also depends on the various factors like learning rate, number epochs and configuration of the training job and the estimator, However I would recommend you to try distributed training with the pytorch using the smdistributed [1] on multiple instances , Currently, the following are supported: distributed training with parameter servers, SageMaker Distributed (SMD) Data and Model Parallelism, and MPI. SMD Model Parallelism can only be used with MPI.

To enable the SageMaker distributed data parallelism: { "smdistributed": { "dataparallel": { "enabled": True } } }

I would also recommend to try the PIPE mode, With Pipe input mode, your dataset is streamed directly to your training instances instead of being downloaded first. This means that your training jobs start sooner, finish quicker, and need less disk space.

If you are facing any issues and require further investigation on the issue, I would encourage you to open a case with the premium support along with the training job ARN and the cloudwatch logs of the job.Due to security reason, this post is not suitable for sharing customer's resource.

Reference:

[1] https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html

[2] https://aws.amazon.com/blogs/machine-learning/using-pipe-input-mode-for-amazon-sagemaker-algorithms/

[3] https://github.com/aws/amazon-sagemaker-examples/blob/80df7d61a4bf14a11f0442020e2003a7c1f78115/advanced_functionality/pipe_bring_your_own/train.py

[4] https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/pytorch/data_parallel/maskrcnn/pytorch_smdataparallel_maskrcnn_demo.html

AWS
已回答 2 年前
0

Hi , We are facing some problem with our instance, It is showing us "OS errno 28 - No space left on device", while we have space on our server. Can you please check and see the problem. why is showing this error? Enter image description here

Manjeet
已回答 3 个月前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则