What is a good AMI for GPU-based tensorflow work?

0

Hi, Using the ec2 instance (p3.8xlarge - 4 V100 GPUs) I cannot find an AMI that makes use of these with tensorflow [-gpu].

I've tried:

  • NVIDIA GPU-Optimized AMI - issues, very little installed, but that's ok as it is ubuntu and you can install whatever. But, with tensorflow-gpu installed: print (tf.version) print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU'))) 2.9.1 Num GPUs Available: 0 Other AMI's see the GPUs, but are unable to utilize them (using nvidia-smi to monitor usage). For this, I've tried:
  • Deep Learning AMI GPU TensorFlow 2.9.1 (Amazon Linux 2) 20220803
  • Deep Learning AMI GPU TensorFlow 2.9.1 (Ubuntu 20.04) 20220803

Has anyone successfully deployed GPUs on a DLAMI? This time last year I was able to use individual GPUs ok (distributed is another thing, but this year none are working).

Thanks,

Example nvidia-smi output: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 | | N/A 40C P0 35W / 300W | 3MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 | | N/A 39C P0 35W / 300W | 3MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 | | N/A 38C P0 40W / 300W | 3MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 | | N/A 38C P0 38W / 300W | 3MiB / 16384MiB | 0% Default |

질문됨 2년 전825회 조회
1개 답변
0

Hello,

The size of your model should be a factor in selecting an instance. If your model exceeds an instance's available RAM, select a different instance type with enough memory for your application.

Amazon EC2 P3 Instances have up to 8 NVIDIA Tesla V100 GPUs.

Amazon EC2 P4 Instances have up to 8 NVIDIA Tesla A100 GPUs.

Amazon EC2 G3 Instances have up to 4 NVIDIA Tesla M60 GPUs.

Amazon EC2 G4 Instances have up to 4 NVIDIA T4 GPUs.

Amazon EC2 G5 Instances have up to 8 NVIDIA A10G GPUs.

Amazon EC2 G5g Instances have Arm-based AWS Graviton2 processors

Refer this link- https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html for more details on the selection of instances.

Hope this helps, If yes, please click accepted answer.

profile pictureAWS
지원 엔지니어
답변함 2년 전
  • Yes, the size of the model is one thing - mine is pretty large (5 MByte), but the data set can be 1000's of times larger than that. So holding batches in memory of the GPU is critical. Hence I chose p3.8xlarge, which is the smallest I can get away with. But the question wasn't about that, it was what AMI instance to choose. I listed three that were no use and was asking the community what they have found to work well with tensorflow, good utilization of GPUs, and potentially distributed training.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠