- 新しい順
- 投票が多い順
- コメントが多い順
Hi
Thank you for reaching out to us.
In general the no space left on device error occurs when there is a high disk utilization, Requesting you to review the instance metrics and cloudwatch metrics/logs of the training job for more detailed information.
It also depends on the various factors like learning rate, number epochs and configuration of the training job and the estimator, However I would recommend you to try distributed training with the pytorch using the smdistributed [1] on multiple instances , Currently, the following are supported: distributed training with parameter servers, SageMaker Distributed (SMD) Data and Model Parallelism, and MPI. SMD Model Parallelism can only be used with MPI.
To enable the SageMaker distributed data parallelism:
{ "smdistributed": { "dataparallel": { "enabled": True } } }
I would also recommend to try the PIPE mode, With Pipe input mode, your dataset is streamed directly to your training instances instead of being downloaded first. This means that your training jobs start sooner, finish quicker, and need less disk space.
If you are facing any issues and require further investigation on the issue, I would encourage you to open a case with the premium support along with the training job ARN and the cloudwatch logs of the job.Due to security reason, this post is not suitable for sharing customer's resource.
Reference:
[1] https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html
[2] https://aws.amazon.com/blogs/machine-learning/using-pipe-input-mode-for-amazon-sagemaker-algorithms/
Hi , We are facing some problem with our instance, It is showing us "OS errno 28 - No space left on device", while we have space on our server. Can you please check and see the problem. why is showing this error?
関連するコンテンツ
- 質問済み 3年前
- AWS公式更新しました 2年前