Hello,
I tried using Inf1 EC2 instance for deploying my ML model. I need to monitor the GPU usage of the ML model. I could find the CPU usage in the aws console, but not gpu usage.
Already tried:
-
https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-gpu-monitoring-gpumon.html
This didn't work. It threw this error
(python3) ubuntu@ip-xxx-mm-yy-zzz:~/tools/GPUCloudWatchMonitor$ python3 gpumon.py
Traceback (most recent call last):
File "gpumon.py", line 146, in <module>
nvmlInit()
File "/home/ubuntu/anaconda3/envs/python3/lib/python3.8/site-packages/pynvml/nvml.py", line 1450, in nvmlInit
nvmlInitWithFlags(0)
File "/home/ubuntu/anaconda3/envs/python3/lib/python3.8/site-packages/pynvml/nvml.py", line 1440, in nvmlInitWithFlags
_nvmlCheckReturn(ret)
File "/home/ubuntu/anaconda3/envs/python3/lib/python3.8/site-packages/pynvml/nvml.py", line 765, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.nvml.NVMLError_DriverNotLoaded: Driver Not Loaded
Also, nvidia-smi
didn't work
(python3) ubuntu@ip-xxx-mm-yy-zzz:~/tools/GPUCloudWatchMonitor$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Kindly provide some help to monitor GPU usage.