We're running g4dn.xlarges in a few ECS clusters for some ML services, and use the AWS-provided GPU-optimized ECS AMI (https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Images:visibility=public-images;imageId=ami-07dd70259efc9d59b). This morning at around 7-8am PST (12/7/2022), newly-provisioned container instances stopped being able to register with our ECS clusters.
After some poking around on the boxes and reading /var/log/ecs/ecs-init.log, it turned out that we were getting errors in nvml that prevented the ECS init routine from completing:
[ERROR] Nvidia GPU Manager: setup failed: error initializing nvidia nvml: nvml: Driver/library version mismatch
This is the same AMI as some older instances in the cluster that started up fine. We noticed the issue simultaneously across 4 different clusters. Manually killing and restart nvidia components on individual hosts resolved the mismatch and allowed ECS init to complete (and the instances to become available for task allocation):
[ec2-user@- ~]$ lsmod | grep nvidia
nvidia_drm 61440 0
nvidia_modeset 1200128 1 nvidia_drm
nvidia_uvm 1142784 0
nvidia 35459072 2 nvidia_uvm,nvidia_modeset
drm_kms_helper 184320 1 nvidia_drm
drm 421888 4 drm_kms_helper,nvidia,nvidia_drm
i2c_core 77824 3 drm_kms_helper,nvidia,drm
[ec2-user@- ~]$ sudo rmmod nvidia_uvm
[ec2-user@- ~]$ sudo rmmod nvidia_drm
[ec2-user@- ~]$ sudo rmmod nvidia_modeset
[ec2-user@- ~]$ sudo rmmod nvidia
[ec2-user@- ~]$ nvidia-smi
This seems a bit bonkers, as it's a regression in the absence of a new AMI or any changes to our application or AWS resources. What causes this spontaneous mismatch and how can we work around it in an automated fashion?