Amazon Base AMI stopped working

Question

Hello,  
  
we had a working setup, which unexpectedly stopped working: deep learning containers on AWS Batch. Our stack consisted of:  
  
- Compute environments with p3.2xlarge and g4.2xlarge instances  
- Amazon ECS-optimized GPU AMI (late 2020 and 2021 versions)  
- Docker base image: nvcr.io/nvidia/cuda:10.2-cudnn8-devel-ubuntu18.04  
  
Our container has a simple nvidia-smi check. This setup worked throughout February, yesterday we ran some jobs on P3 instances, which started failing. We tried running them on G4, they work as expected.. Since nvidia-smi fails, there must be some sort of CUDA or NVIDIA driver mismatch. Regardless, the AMI IDs and base docker versions have not changed between our tests.  
  
The P3 instance itself has the following NVIDIA configuration:  
NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0  
  
We tried the updated AMIs but the issue persists. Any ideas what might be causing these issues?

Answer

@thepers Yes. We were able to deduce that the Batch Console for submitting jobs is buggy!  
  
Even though we requested a GPU in the job definition AND in the job submission form, the job instance itself in the Batch Console displays a GPU value of "--".  
  
Try using the AWS CLI command "batch submit-job", it worked for us. Please confirm later if it is an issue AWS should be fixing.  
  
Edited by: mihaj on Feb 24, 2021 10:39 PM

Answer

Hi,  
  
Have you been able to solve this issue?   
  
I'm experiencing a similar error with a compute environment with p2 instances. I'm trying to run a tensorflow training job that previously has been able to run on GPU, but now it does not find the GPU from the docker container. By manually entering the docker container I have also verified that nvidia-smi fails from within the container, but it works if I manually create a container with GPU mounted from the same docker image and the same instance. It seems like the docker container created by the batch job does not get access to the GPU anymore, even though I specify 1 GPU in the job definition.   
  
My compute environment is using ami-0c5652efc27d66bb3 with instance type p2.xlarge. I have also tried using earlier versions of the ami, but it did not solve the problem.

Amazon Base AMI stopped working

Contenuto pertinente