AWS Batch w/ SPOT instances: CannotPullContainerError due to "no space left on device"

0

I am using AWS Batch SPOT instances for workloads that require GPUs. Every now and then, I receive an error like the following:

CannotPullContainerError: failed to register layer: Error processing tar file(exit status 1): write /usr/local/lib/python3.8/dist-packages/nvidia/cublas/lib/libcublasLt.so.11: no space left on device

My Job Definition specifies that I need a GPU and it pulls down a custom docker image. These images are large - right around 6GB. How do I prevent this from happening? It does not appear that I can specify a minimum disk space requirement in the Job Definition.

I have seen some question/answers like the following, but I (ignorantly) do not believe this is relevant for SPOT instances. Perhaps I am mistaken? https://repost.aws/questions/QUx6Ix1R1SSNisYSs1Sw8EBA/cannotpullcontainererror-no-space-left-on-device

jag
已提問 1 年前檢視次數 277 次
1 個回答
0

https://aws.amazon.com/getting-started/hands-on/run-batch-jobs-at-scale-with-ec2-spot/

Have a look at above link. At step 1.6 you should be able to select prefered instance type as your docker image is very large. Also check the instance default EBS volume size.

"You can add additional instance families or specific types, including GPU instance families (P2 and P3) if your jobs require it; AWS Batch will pick the optimal set of instances based on your jobs’ requirements, while benefiting from the 90% discount Spot Instances offer over On-Demand pricing."

AWS
Rachel
已回答 1 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南