AWS Batch w/ SPOT instances: CannotPullContainerError due to "no space left on device"

0

I am using AWS Batch SPOT instances for workloads that require GPUs. Every now and then, I receive an error like the following:

CannotPullContainerError: failed to register layer: Error processing tar file(exit status 1): write /usr/local/lib/python3.8/dist-packages/nvidia/cublas/lib/libcublasLt.so.11: no space left on device

My Job Definition specifies that I need a GPU and it pulls down a custom docker image. These images are large - right around 6GB. How do I prevent this from happening? It does not appear that I can specify a minimum disk space requirement in the Job Definition.

I have seen some question/answers like the following, but I (ignorantly) do not believe this is relevant for SPOT instances. Perhaps I am mistaken? https://repost.aws/questions/QUx6Ix1R1SSNisYSs1Sw8EBA/cannotpullcontainererror-no-space-left-on-device

1回答
0

https://aws.amazon.com/getting-started/hands-on/run-batch-jobs-at-scale-with-ec2-spot/

Have a look at above link. At step 1.6 you should be able to select prefered instance type as your docker image is very large. Also check the instance default EBS volume size.

"You can add additional instance families or specific types, including GPU instance families (P2 and P3) if your jobs require it; AWS Batch will pick the optimal set of instances based on your jobs’ requirements, while benefiting from the 90% discount Spot Instances offer over On-Demand pricing."

AWS
Rachel
回答済み 1年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ