AWS Batch w/ SPOT instances: CannotPullContainerError due to "no space left on device"

0

I am using AWS Batch SPOT instances for workloads that require GPUs. Every now and then, I receive an error like the following:

CannotPullContainerError: failed to register layer: Error processing tar file(exit status 1): write /usr/local/lib/python3.8/dist-packages/nvidia/cublas/lib/libcublasLt.so.11: no space left on device

My Job Definition specifies that I need a GPU and it pulls down a custom docker image. These images are large - right around 6GB. How do I prevent this from happening? It does not appear that I can specify a minimum disk space requirement in the Job Definition.

I have seen some question/answers like the following, but I (ignorantly) do not believe this is relevant for SPOT instances. Perhaps I am mistaken? https://repost.aws/questions/QUx6Ix1R1SSNisYSs1Sw8EBA/cannotpullcontainererror-no-space-left-on-device

1 Answer
0

https://aws.amazon.com/getting-started/hands-on/run-batch-jobs-at-scale-with-ec2-spot/

Have a look at above link. At step 1.6 you should be able to select prefered instance type as your docker image is very large. Also check the instance default EBS volume size.

"You can add additional instance families or specific types, including GPU instance families (P2 and P3) if your jobs require it; AWS Batch will pick the optimal set of instances based on your jobs’ requirements, while benefiting from the 90% discount Spot Instances offer over On-Demand pricing."

AWS
Rachel
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions