DL AMI GPU TensorFlow 2.9.1 (Ubuntu 20.04) on p3.8xlarge not using GPUs

0

This AMI comes with tensorflow, but I had to add tensorflow-gpu. When I test what we have: print (tf.version) print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU'))) I get: 2.9.1 Num GPUs Available: 4 And running the canonical examples shows a GPU being used: c = tf.matmul(a, b) Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0 Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0 Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0

My code is implemented to do distributed training (and I've tested without that too), but the result is the same: nvidia-smi shows basically no GPU usage. It's also very slow (training on 50,000 image for semantic segmentation):

| 0 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 | | N/A 39C P0 49W / 300W | 15238MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 | | N/A 40C P0 49W / 300W | 15112MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 | | N/A 39C P0 52W / 300W | 15112MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 | | N/A 40C P0 54W / 300W | 15112MiB / 16384MiB | 0% Default |

Does anyone have any idea what I should look at? Or - and I have looked - are there examples out there of this working?

GPU-Util at 0% is not right. On my workstation with a Quadro GV100 I run it to 97% and 30G memory usage.

https://www.tensorflow.org/guide/distributed_training

mirrored_strategy = tf.distribute.MirroredStrategy()

질문됨 2년 전418회 조회
1개 답변
0

Hello,

Thank you for using AWS DLAMI.

Looking at the above query, I understand that while using DLAMI for TensorFlow 2.9.1 it was observed that GPU utilization was zero and it was using CPU after adding tensorflow-gpu explicitly.

From the complete scenario shared, I assume that the tensorflow-gpu is added but eventually it is using the tensorflow-cpu, but again due to limited visibility on your setup and code used I cannot be sure enough to claim that. To further understand the issue more in depth, I'd recommend you to reach to AWS Premium Support by creating a support case so that the engineer can investigate the root cause of the issue.

But prior to that, I'd highly recommend using DL container on SageMaker instead using DLAMI on EC2. Along with that you use SageMaker distributed training libraries which are available only through the AWS deep learning containers for the TensorFlow. Additionally SageMaker also provides Bring you container, where the users can built their own container as per their requirements for the use case.

If using SageMaker also does not help overcome the issue, then as mentioned above please reach out to AWS Premium Support so that we can better assist you and help you overcome the issue.

Reference:

[1] TensorFlow with Horovod - https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-horovod-tensorflow.html

[2] Amazon SageMaker Distributed Training Libraries - https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html

[3] https://github.com/aws/sagemaker-tensorflow-extensions

[4] Use TensorFlow with Amazon SageMaker - https://docs.aws.amazon.com/sagemaker/latest/dg/tf.html

[5] Open a support case with AWS using the link: https://console.aws.amazon.com/support/home?#/case/create

AWS
지원 엔지니어
답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인