AWS Batch w/ SPOT instances: CannotPullContainerError due to "no space left on device"

0

I am using AWS Batch SPOT instances for workloads that require GPUs. Every now and then, I receive an error like the following:

CannotPullContainerError: failed to register layer: Error processing tar file(exit status 1): write /usr/local/lib/python3.8/dist-packages/nvidia/cublas/lib/libcublasLt.so.11: no space left on device

My Job Definition specifies that I need a GPU and it pulls down a custom docker image. These images are large - right around 6GB. How do I prevent this from happening? It does not appear that I can specify a minimum disk space requirement in the Job Definition.

I have seen some question/answers like the following, but I (ignorantly) do not believe this is relevant for SPOT instances. Perhaps I am mistaken? https://repost.aws/questions/QUx6Ix1R1SSNisYSs1Sw8EBA/cannotpullcontainererror-no-space-left-on-device

1개 답변
0

https://aws.amazon.com/getting-started/hands-on/run-batch-jobs-at-scale-with-ec2-spot/

Have a look at above link. At step 1.6 you should be able to select prefered instance type as your docker image is very large. Also check the instance default EBS volume size.

"You can add additional instance families or specific types, including GPU instance families (P2 and P3) if your jobs require it; AWS Batch will pick the optimal set of instances based on your jobs’ requirements, while benefiting from the 90% discount Spot Instances offer over On-Demand pricing."

AWS
Rachel
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인