Questions tagged with Amazon EC2

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

How can I work around spontaneous nvml mismatch errors in AWS ECS gpu image?

We're running g4dn.xlarges in a few ECS clusters for some ML services, and use the AWS-provided GPU-optimized ECS AMI (;imageId=ami-07dd70259efc9d59b). This morning at around 7-8am PST (12/7/2022), newly-provisioned container instances stopped being able to register with our ECS clusters. After some poking around on the boxes and reading /var/log/ecs/ecs-init.log, it turned out that we were getting errors in nvml that prevented the ECS init routine from completing: ``` [ERROR] Nvidia GPU Manager: setup failed: error initializing nvidia nvml: nvml: Driver/library version mismatch ``` This is the same AMI as some older instances in the cluster that started up fine. We noticed the issue simultaneously across 4 different clusters. Manually killing and restart nvidia components on individual hosts resolved the mismatch and allowed ECS init to complete (and the instances to become available for task allocation): ``` [ec2-user@- ~]$ lsmod | grep nvidia nvidia_drm 61440 0 nvidia_modeset 1200128 1 nvidia_drm nvidia_uvm 1142784 0 nvidia 35459072 2 nvidia_uvm,nvidia_modeset drm_kms_helper 184320 1 nvidia_drm drm 421888 4 drm_kms_helper,nvidia,nvidia_drm i2c_core 77824 3 drm_kms_helper,nvidia,drm [ec2-user@- ~]$ sudo rmmod nvidia_uvm [ec2-user@- ~]$ sudo rmmod nvidia_drm [ec2-user@- ~]$ sudo rmmod nvidia_modeset [ec2-user@- ~]$ sudo rmmod nvidia [ec2-user@- ~]$ nvidia-smi ``` This seems a bit bonkers, as it's a regression in the absence of a new AMI or any changes to our application or AWS resources. What causes this spontaneous mismatch and how can we work around it in an automated fashion?
asked 13 hours ago