Questions tagged with GPU Development
Content language: English
Sort by most recent
Browse through the questions and answers listed below or filter and sort to narrow down your results.
Sagemaker g4 and g5 instances do not have working nvidia-drivers
I am a heavy user of g4 and g5 instances on Sagemaker (notebook instances). Today when I tried to use the same instances as I always do I was met with the following when running `nvidia-smi` `NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.` These are all the exact same instances and workloads I have used before. The same message was found when trying to run on ec2 natively as well.
Parallel or Split GPUs for Video Streaming
Hi, I'm looking for ways to use GPUs for video streaming in a parallel fashion (similar to distributed training). Within the same region, I would like to leverage several kinds of GPU (if possible) to optimize our video streaming software delivery - think Amazon Luna. This solution: https://github.com/jamesstringerparsec/Easy-GPU-PV sounds like what we aim for (but it's for Windows). Is there any way to do that? Appreciate your help!
Upgrade nvidia-driver in Amazon EKS AMI with nvidia gpu support
The current EKS optimized Amazon Linux AMI ships nvidia-driver version 470. Unfortunately, our software requires version 510. Is there an official AMI with such version, or how could I upgrade the nvidia-driver in the AMI or perhaps using an overridden bootstrap command?
Expensive GPU EC2 instances consistently take a long time to terminate after power down
Hi, e.g. i-09458427c9b51b6f9 or i-07a81d23b834d67a3 which were both g3s.xlarge was stuck for about 10 mins in shutting-down, after "power down" can be seen in the serial console. Power down is the last thing that the kernel logs - so by this point all the various system scripts had completed. I think perhaps there might be an issue in the virtualization layer, as it shouldn't take this long to recognise that the kernel has terminated. These 10 unusable minutes are billable - so its a fairly expensive bug - at least $0.12 per instance (if using on-demand pricing) Thanks for taking a look. Cheers James
How to monitor GPU usage in EC2 Inf1 instance?
Hello, I tried using [Inf1 EC2 instance](https://aws.amazon.com/ec2/instance-types/inf1/) for deploying my ML model. I need to monitor the GPU usage of the ML model. I could find the CPU usage in the aws console, but not gpu usage. Already tried: 1. https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-gpu-monitoring-gpumon.html This didn't work. It threw this error ``` (python3) ubuntu@ip-xxx-mm-yy-zzz:~/tools/GPUCloudWatchMonitor$ python3 gpumon.py Traceback (most recent call last): File "gpumon.py", line 146, in <module> nvmlInit() File "/home/ubuntu/anaconda3/envs/python3/lib/python3.8/site-packages/pynvml/nvml.py", line 1450, in nvmlInit nvmlInitWithFlags(0) File "/home/ubuntu/anaconda3/envs/python3/lib/python3.8/site-packages/pynvml/nvml.py", line 1440, in nvmlInitWithFlags _nvmlCheckReturn(ret) File "/home/ubuntu/anaconda3/envs/python3/lib/python3.8/site-packages/pynvml/nvml.py", line 765, in _nvmlCheckReturn raise NVMLError(ret) pynvml.nvml.NVMLError_DriverNotLoaded: Driver Not Loaded ``` Also, `nvidia-smi` didn't work ``` (python3) ubuntu@ip-xxx-mm-yy-zzz:~/tools/GPUCloudWatchMonitor$ nvidia-smi NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. ``` Kindly provide some help to monitor GPU usage.
Why is the GPU not working out of the box for Deep learning AMI EC2 instance?
I'm having trouble using the GPU for a Deep learning GPU EC2 instance. The specs of the instance are: - Deep Learning AMI GPU PyTorch 1.11.0 (Amazon Linux 2) 20220328 - amazon/Deep Learning AMI GPU PyTorch 1.11.0 (Amazon Linux 2) 20220328 When I log into the instance and I run `nvidia smi`, I get the error: `NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.` Similarly, when I run a pytorch (pre-installed) command to check whether it can see a GPU, it returns False: `(pytorch) [ec2-user@ip-172-31-86-58 ~]$ python3` `Python 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:25:59)` `[GCC 10.3.0] on linux` `Type "help", "copyright", "credits" or "license" for more information.` `>>> import torch` `>>> torch.cuda.is_available()` `False` The GPU set up should have worked out of the box but how do I fix this?
Elastic Graphics Quota exceeded
Hello, I am receiving this error when using terraform to provision a t3.medium with eg1.medium Elastic Graphic: ``` Error launching source instance: ElasticGpuLimitExceeded: Your quota allows for 0 more Elastic Graphics accelerators in your account. You requested at least 1. ``` I can't find limits in the limit section nor any quotas. I have terminated all instances that use Elastic Graphics and still cannot create any new ones. Please help.