What is a good AMI for GPU-based tensorflow work?


Hi, Using the ec2 instance (p3.8xlarge - 4 V100 GPUs) I cannot find an AMI that makes use of these with tensorflow [-gpu].

I've tried:

  • NVIDIA GPU-Optimized AMI - issues, very little installed, but that's ok as it is ubuntu and you can install whatever. But, with tensorflow-gpu installed: print (tf.version) print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU'))) 2.9.1 Num GPUs Available: 0 Other AMI's see the GPUs, but are unable to utilize them (using nvidia-smi to monitor usage). For this, I've tried:
  • Deep Learning AMI GPU TensorFlow 2.9.1 (Amazon Linux 2) 20220803
  • Deep Learning AMI GPU TensorFlow 2.9.1 (Ubuntu 20.04) 20220803

Has anyone successfully deployed GPUs on a DLAMI? This time last year I was able to use individual GPUs ok (distributed is another thing, but this year none are working).


Example nvidia-smi output: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 | | N/A 40C P0 35W / 300W | 3MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 | | N/A 39C P0 35W / 300W | 3MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 | | N/A 38C P0 40W / 300W | 3MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 | | N/A 38C P0 38W / 300W | 3MiB / 16384MiB | 0% Default |

asked 2 years ago937 views
1 Answer


The size of your model should be a factor in selecting an instance. If your model exceeds an instance's available RAM, select a different instance type with enough memory for your application.

Amazon EC2 P3 Instances have up to 8 NVIDIA Tesla V100 GPUs.

Amazon EC2 P4 Instances have up to 8 NVIDIA Tesla A100 GPUs.

Amazon EC2 G3 Instances have up to 4 NVIDIA Tesla M60 GPUs.

Amazon EC2 G4 Instances have up to 4 NVIDIA T4 GPUs.

Amazon EC2 G5 Instances have up to 8 NVIDIA A10G GPUs.

Amazon EC2 G5g Instances have Arm-based AWS Graviton2 processors

Refer this link- https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html for more details on the selection of instances.

Hope this helps, If yes, please click accepted answer.

profile pictureAWS
answered 2 years ago
  • Yes, the size of the model is one thing - mine is pretty large (5 MByte), but the data set can be 1000's of times larger than that. So holding batches in memory of the GPU is critical. Hence I chose p3.8xlarge, which is the smallest I can get away with. But the question wasn't about that, it was what AMI instance to choose. I listed three that were no use and was asking the community what they have found to work well with tensorflow, good utilization of GPUs, and potentially distributed training.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions