How can I work around spontaneous nvml mismatch errors in AWS ECS gpu image?

0

We're running g4dn.xlarges in a few ECS clusters for some ML services, and use the AWS-provided GPU-optimized ECS AMI (https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Images:visibility=public-images;imageId=ami-07dd70259efc9d59b). This morning at around 7-8am PST (12/7/2022), newly-provisioned container instances stopped being able to register with our ECS clusters.

After some poking around on the boxes and reading /var/log/ecs/ecs-init.log, it turned out that we were getting errors in nvml that prevented the ECS init routine from completing:

[ERROR] Nvidia GPU Manager: setup failed: error initializing nvidia nvml: nvml: Driver/library version mismatch

This is the same AMI as some older instances in the cluster that started up fine. We noticed the issue simultaneously across 4 different clusters. Manually killing and restart nvidia components on individual hosts resolved the mismatch and allowed ECS init to complete (and the instances to become available for task allocation):

[ec2-user@- ~]$ lsmod | grep nvidia
nvidia_drm             61440  0
nvidia_modeset       1200128  1 nvidia_drm
nvidia_uvm           1142784  0
nvidia              35459072  2 nvidia_uvm,nvidia_modeset
drm_kms_helper        184320  1 nvidia_drm
drm                   421888  4 drm_kms_helper,nvidia,nvidia_drm
i2c_core               77824  3 drm_kms_helper,nvidia,drm
[ec2-user@- ~]$ sudo rmmod nvidia_uvm
[ec2-user@- ~]$ sudo rmmod nvidia_drm
[ec2-user@- ~]$ sudo rmmod nvidia_modeset
[ec2-user@- ~]$ sudo rmmod nvidia
[ec2-user@- ~]$ nvidia-smi

This seems a bit bonkers, as it's a regression in the absence of a new AMI or any changes to our application or AWS resources. What causes this spontaneous mismatch and how can we work around it in an automated fashion?

asked a year ago84 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions