- Newest
- Most votes
- Most comments
Hi, I tried following the blog and noetbook linked above, and got "no space left on device error" with a ml.g5.12xlarge instance. What should I do?
Hello,
I understand that you encountered OOM error while fine-tuning falcon 40b on SageMaker using the following instances: P3.16x, 24xlarge and 12xlarge.
In the following blog post [1] and notebook example [2], a ml.g5.12xlarge instance was utilized to fine-tune Falcon-40B; could you kindly try with the same instance or choose a larger instance from a family of ml.g5 instances [1]. And for larger models kindly try using ml.p4d, ml.p4de and ml.inf1 instances.
To request a service quota increase for instances, on the AWS Service Quotas console, navigate to AWS services, Amazon SageMaker, and select Studio KernelGateway Apps running on ml.g5.12xlarge instances.
Reference
[1] https://aws.amazon.com/blogs/machine-learning/interactively-fine-tune-falcon-40b-and-other-llms-on-amazon-sagemaker-studio-notebooks-using-qlora/ [2] https://github.com/aws-samples/amazon-sagemaker-generativeai/blob/main/studio-notebook-fine-tuning/falcon-40b-qlora-finetune-summarize.ipynb [3] https://aws.amazon.com/sagemaker/pricing/
Relevant content
- asked 6 months ago
- asked 9 months ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 19 days ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
I would suggest creating a Support case because that way the Support engineer can look into the specific issue in a more fine grained manner. In general there are multiple reasons why OOM error Might occur.
Try granularizing the Model a bit more and use a better strategy for checkpointing. Use techniques like Shared Data Parallelism.
If all of this fails submit a support ticket so the Team can look into this.