- 最新
- 投票最多
- 评论最多
Hi,
Have you been able to solve this issue?
I'm experiencing a similar error with a compute environment with p2 instances. I'm trying to run a tensorflow training job that previously has been able to run on GPU, but now it does not find the GPU from the docker container. By manually entering the docker container I have also verified that nvidia-smi fails from within the container, but it works if I manually create a container with GPU mounted from the same docker image and the same instance. It seems like the docker container created by the batch job does not get access to the GPU anymore, even though I specify 1 GPU in the job definition.
My compute environment is using ami-0c5652efc27d66bb3 with instance type p2.xlarge. I have also tried using earlier versions of the ami, but it did not solve the problem.
@thepers Yes. We were able to deduce that the Batch Console for submitting jobs is buggy!
Even though we requested a GPU in the job definition AND in the job submission form, the job instance itself in the Batch Console displays a GPU value of "--".
Try using the AWS CLI command "batch submit-job", it worked for us. Please confirm later if it is an issue AWS should be fixing.
Edited by: mihaj on Feb 24, 2021 10:39 PM
相关内容
- AWS 官方已更新 2 年前
- AWS 官方已更新 3 年前
- AWS 官方已更新 2 年前