AWS Batch w/ SPOT instances: CannotPullContainerError due to "no space left on device"

0

I am using AWS Batch SPOT instances for workloads that require GPUs. Every now and then, I receive an error like the following:

CannotPullContainerError: failed to register layer: Error processing tar file(exit status 1): write /usr/local/lib/python3.8/dist-packages/nvidia/cublas/lib/libcublasLt.so.11: no space left on device

My Job Definition specifies that I need a GPU and it pulls down a custom docker image. These images are large - right around 6GB. How do I prevent this from happening? It does not appear that I can specify a minimum disk space requirement in the Job Definition.

I have seen some question/answers like the following, but I (ignorantly) do not believe this is relevant for SPOT instances. Perhaps I am mistaken? https://repost.aws/questions/QUx6Ix1R1SSNisYSs1Sw8EBA/cannotpullcontainererror-no-space-left-on-device

1 Respuesta
0

https://aws.amazon.com/getting-started/hands-on/run-batch-jobs-at-scale-with-ec2-spot/

Have a look at above link. At step 1.6 you should be able to select prefered instance type as your docker image is very large. Also check the instance default EBS volume size.

"You can add additional instance families or specific types, including GPU instance families (P2 and P3) if your jobs require it; AWS Batch will pick the optimal set of instances based on your jobs’ requirements, while benefiting from the 90% discount Spot Instances offer over On-Demand pricing."

AWS
Rachel
respondido hace un año

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas