By using AWS re:Post, you agree to the AWS re:Post Terms of Use

No space left on device - Error during training a 6.8 GB dataset on llama 3.1 8B instruct

0

Enter image description here Enter image description here

I have lost around 1000$ trying different variations. I would be grateful if anyone has any idea how to solve it. As you can see in the image i have increased the storage to 1024 GB which is more than enough for a 6.8 GB training dataset.

This is the instance on which the data is being trained.

1 Answer
0

I apologize for the frustration you're experiencing with this error. The "No space left on device" error you're encountering is not likely related to the overall storage capacity of your instance, but rather to how the space is being utilized during the training process.

This error typically occurs when a specific filesystem or partition runs out of space, even if there's plenty of storage available elsewhere on the system. In your case, it appears that the root filesystem (/) is filling up during the training process.

Here are a few potential solutions to consider:

  1. Check disk usage: Run "df -h" to see which partitions are full. It's possible that temporary files or logs are filling up a specific partition.

  2. Clean up unnecessary files: Remove any large, unnecessary files or old Docker images/containers that might be taking up space.

  3. Adjust partition sizes: If possible, resize partitions to allocate more space to the root filesystem.

  4. Use a different directory: Try changing the working directory for your training process to a partition with more available space, such as /home or /tmp.

  5. Monitor disk usage during training: Add logging to track disk usage as the training progresses to identify where the space is being consumed.

  6. Optimize your training pipeline: Look for ways to reduce disk usage in your training process, such as processing data in smaller batches or cleaning up temporary files more aggressively.

  7. Consider using Amazon EFS: If the issue persists, you might want to consider using Amazon Elastic File System (EFS) for your training data and output, which provides scalable storage separate from your instance's local storage.

If none of these solutions work, you may need to reach out to AWS support for more detailed troubleshooting specific to your setup and the Llama 3.1 8B model training process.
Sources
AWS CodeBuild Failed with Error: no space left on device | AWS re:Post
"Cannot write: No space left on device" error even though there is storage available | AWS re:Post
No space left on device when deploy ECS Fargate Service with CDK | AWS re:Post

profile picture
answered 12 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions