CuDNN Library Not Working Out-of-the-Box

0

I am using the g4dn.xlarge instance type with the Deep Learning AMI GPU TensorFlow 2.10.0 (Amazon Linux 2) 20220927.

Upon logging in for the first time, I test the installation and get:

[ec2-user@ip-10-0-0-133 ~]$ /usr/local/bin/python3.9 -c "import tensorflow"
2022-10-06 07:09:39.691571: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-06 07:09:39.815546: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-10-06 07:09:39.848413: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-10-06 07:09:40.583444: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/usr/local/cuda:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/lib:/usr/lib:/lib:
2022-10-06 07:09:40.583574: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/usr/local/cuda:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/lib:/usr/lib:/lib:
2022-10-06 07:09:40.583594: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

The error was worrying, and so I decided to check the status of CUDA and cuDNN.

[ec2-user@ip-10-0-0-133 ~]$ whereis nvcc
nvcc: /usr/local/cuda-11.2/bin/nvcc.profile /usr/local/cuda-11.2/bin/nvcc
[ec2-user@ip-10-0-0-133 ~]$ whereis cudnn.h
cudnn:

The lack of path for cuDNN is of course a problem. To better confirm, however, I ran the scripts in this answer: https://stackoverflow.com/a/47436840 I get the outputs:

libcudart.so.11.0 -> libcudart.so.11.2.152
libcuda.so.1 -> libcuda.so.510.47.03
libcuda.so.1 -> libcuda.so.510.47.03
libcuda is installed
libcudart.so.11.0 -> libcudart.so.11.2.152
libcudart is installed

and

ERROR: libcudnn is NOT installed

However, when navigating to the CUDA folder, I see the the cuDNN files are actually present, and so I'm unsure what the problem is.

PSW
asked 2 years ago2726 views
2 Answers
1

Hello there,

I understand that you launched the AWS Deep Learning AMI GPU TensorFlow 2.10.0 (Amazon Linux 2) 20220927 with instance type g4dn.xlarge. This was done successfully. Upon ssh-ing into your instance you ran the following in the terminal:

$/usr/local/bin/python3.9 -c "import tensorflow"

This was to verify that TensorFlow is installed on this type of instance. However, doing so resulted in errors as in your Re:Post question. Please let me know if I have miss understood anything.

I did some research on this and found out this is a common issue with TensorFlow version 2.10 [1]. Starting from TensorFlow version 2.10 Linux CPU builds for Aarch64/ARM64 processors are built and maintained by AWS [2]. Installing TensorFlow in these machines installs tensorflow-cpu-aws by default [3] which is already installed in your instance. TensorFlow fails to load on older GPUs when CUDA_FORCE_PTX_JIT=1 is set [4] this may be the case for this type of instance. The error about plugin cuBLAS and warnings that you got when importing TensorFlow do not seem to be preventing you to proceed using TensorFlow [1].

  • Install Miniconda
$curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh  -o Miniconda3-latest-Linux-x86_64.sh
$bash Miniconda3-latest-Linux-x86_64.sh
  • Create a conda environment and activate the environment
$conda create --name tf_env python=3.9
$conda activate tf_env
  • Install TensorFlow
$pip install tensorflow=2.9.2

I hope you found the provided information helpful and thank you again for reaching out to Premium Support. Should you have any further questions or require any additional assistance, please feel free to reach out and I will be more than happy to assist.

Resources

  1. https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras
  2. https://blog.tensorflow.org/2022/09/announcing-tensorflow-official-build-collaborators.html
  3. https://pypi.org/project/tensorflow-cpu-aws/
  4. https://github.com/tensorflow/tensorflow/issues/57679
  5. https://www.tensorflow.org/install/pipTensorF
  6. https://docs.conda.io/en/latest/miniconda.html
  7. https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-add-external.html
AWS
answered 2 years ago
  • Hi Cebz,

    Your suggestion gets me a working version of Tensorflow 2.9.2, in fact, there is no need to even use conda since downgrading to tensorflow 2.9.2 works. But I do not want to use tensorflow 2.9.2. If I did, I would have selected the Tensorflow 2.9.2 AMI's available. For my task, I require some of the features released in 2.10.0. I have attempted the conda solution for Tensorflow 2.10.0, and it did not work.

    You are correct that the error does not prevent me from using TensorFlow, but it does prevent me from using the GPU with TensorFlow. If I do not use an GPU for my task, it would take weeks to complete training the Model. GPU usage is not optional.

    I have attempting using Tensorflow 2.10.0 on the AWS Deep Learning AMI GPU TensorFlow 2.10.0 using the following instance types: g4dn.xlarge (NVIDIA T4 GPU), g5.xlarge (NVIDIA A10G Tensor Core GPU) , and a p3.2xlarge (NVIDIA Tesla V100 GPU). None of these are "older GPUs" especially not the V100.

    All encountered the exact same issue. I am stunned that the GPU accelerating computing instances advertised for ML cannot utilize their accelerators for ML tasks using the latest Amazon ML AMI.

    I would appreciate further assistance in this matter.

0

Hi

Thank you for reaching out to us.

I understand that you are using the g4dn.xlarge instance type with the Deep Learning AMI GPU TensorFlow 2.10.0 (Amazon Linux 2) and having issues using the CuDNN Library and would like to confirm if it is part of the DLAMI

In order to deep-dive further I would request you to confirm the following toQuery AMI-ID with AWSCLI:

Query AMI-ID with AWSCLI (example region is us-east-1): aws ec2 describe-images --region us-east-1 --owners amazon --filters 'Name=name,Values=Deep Learning AMI GPU TensorFlow 2.10.? (Amazon Linux 2) ????????' 'Name=state,Values=available' --query 'reverse(sort_by(Images, &CreationDate))[:1].ImageId' --output text

Requesting you to review the releasenotes of the aws-deep-learning-ami-gpu-tensorflow-2-10-amazon-linux-2 and confirm if you are using the latest version and open a AWS support ticket if you need further guidance.

Due to security reason, this post is not suitable for sharing customer's resource.

If you have other questions or require any further clarifications please don't hesitate to open a support ticket with the AWS premium support and we would be glad to assist you on the issue for further investigation

Reference:

https://aws.amazon.com/releasenotes/aws-deep-learning-ami-gpu-tensorflow-2-10-amazon-linux-2/

AWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions