Questions tagged with Amazon EC2

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

How can I work around spontaneous nvml mismatch errors in AWS ECS gpu image?

We're running g4dn.xlarges in a few ECS clusters for some ML services, and use the AWS-provided GPU-optimized ECS AMI (;imageId=ami-07dd70259efc9d59b). This morning at around 7-8am PST (12/7/2022), newly-provisioned container instances stopped being able to register with our ECS clusters. After some poking around on the boxes and reading /var/log/ecs/ecs-init.log, it turned out that we were getting errors in nvml that prevented the ECS init routine from completing: ``` [ERROR] Nvidia GPU Manager: setup failed: error initializing nvidia nvml: nvml: Driver/library version mismatch ``` This is the same AMI as some older instances in the cluster that started up fine. We noticed the issue simultaneously across 4 different clusters. Manually killing and restart nvidia components on individual hosts resolved the mismatch and allowed ECS init to complete (and the instances to become available for task allocation): ``` [ec2-user@- ~]$ lsmod | grep nvidia nvidia_drm 61440 0 nvidia_modeset 1200128 1 nvidia_drm nvidia_uvm 1142784 0 nvidia 35459072 2 nvidia_uvm,nvidia_modeset drm_kms_helper 184320 1 nvidia_drm drm 421888 4 drm_kms_helper,nvidia,nvidia_drm i2c_core 77824 3 drm_kms_helper,nvidia,drm [ec2-user@- ~]$ sudo rmmod nvidia_uvm [ec2-user@- ~]$ sudo rmmod nvidia_drm [ec2-user@- ~]$ sudo rmmod nvidia_modeset [ec2-user@- ~]$ sudo rmmod nvidia [ec2-user@- ~]$ nvidia-smi ``` This seems a bit bonkers, as it's a regression in the absence of a new AMI or any changes to our application or AWS resources. What causes this spontaneous mismatch and how can we work around it in an automated fashion?
asked 6 hours ago