DL AMI GPU TensorFlow 2.9.1 (Ubuntu 20.04) on p3.8xlarge not using GPUs


This AMI comes with tensorflow, but I had to add tensorflow-gpu. When I test what we have: print (tf.version) print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU'))) I get: 2.9.1 Num GPUs Available: 4 And running the canonical examples shows a GPU being used: c = tf.matmul(a, b) Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0 Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0 Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0

My code is implemented to do distributed training (and I've tested without that too), but the result is the same: nvidia-smi shows basically no GPU usage. It's also very slow (training on 50,000 image for semantic segmentation):

| 0 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 | | N/A 39C P0 49W / 300W | 15238MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 | | N/A 40C P0 49W / 300W | 15112MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 | | N/A 39C P0 52W / 300W | 15112MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 | | N/A 40C P0 54W / 300W | 15112MiB / 16384MiB | 0% Default |

Does anyone have any idea what I should look at? Or - and I have looked - are there examples out there of this working?

GPU-Util at 0% is not right. On my workstation with a Quadro GV100 I run it to 97% and 30G memory usage.


mirrored_strategy = tf.distribute.MirroredStrategy()

asked 4 months ago45 views
1 Answer


Thank you for using AWS DLAMI.

Looking at the above query, I understand that while using DLAMI for TensorFlow 2.9.1 it was observed that GPU utilization was zero and it was using CPU after adding tensorflow-gpu explicitly.

From the complete scenario shared, I assume that the tensorflow-gpu is added but eventually it is using the tensorflow-cpu, but again due to limited visibility on your setup and code used I cannot be sure enough to claim that. To further understand the issue more in depth, I'd recommend you to reach to AWS Premium Support by creating a support case so that the engineer can investigate the root cause of the issue.

But prior to that, I'd highly recommend using DL container on SageMaker instead using DLAMI on EC2. Along with that you use SageMaker distributed training libraries which are available only through the AWS deep learning containers for the TensorFlow. Additionally SageMaker also provides Bring you container, where the users can built their own container as per their requirements for the use case.

If using SageMaker also does not help overcome the issue, then as mentioned above please reach out to AWS Premium Support so that we can better assist you and help you overcome the issue.


[1] TensorFlow with Horovod - https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-horovod-tensorflow.html

[2] Amazon SageMaker Distributed Training Libraries - https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html

[3] https://github.com/aws/sagemaker-tensorflow-extensions

[4] Use TensorFlow with Amazon SageMaker - https://docs.aws.amazon.com/sagemaker/latest/dg/tf.html

[5] Open a support case with AWS using the link: https://console.aws.amazon.com/support/home?#/case/create

answered 4 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions