How can I work around spontaneous nvml mismatch errors in AWS ECS gpu image?

0

We're running g4dn.xlarges in a few ECS clusters for some ML services, and use the AWS-provided GPU-optimized ECS AMI (https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Images:visibility=public-images;imageId=ami-07dd70259efc9d59b). This morning at around 7-8am PST (12/7/2022), newly-provisioned container instances stopped being able to register with our ECS clusters.

After some poking around on the boxes and reading /var/log/ecs/ecs-init.log, it turned out that we were getting errors in nvml that prevented the ECS init routine from completing:

[ERROR] Nvidia GPU Manager: setup failed: error initializing nvidia nvml: nvml: Driver/library version mismatch

This is the same AMI as some older instances in the cluster that started up fine. We noticed the issue simultaneously across 4 different clusters. Manually killing and restart nvidia components on individual hosts resolved the mismatch and allowed ECS init to complete (and the instances to become available for task allocation):

[ec2-user@- ~]$ lsmod | grep nvidia
nvidia_drm             61440  0
nvidia_modeset       1200128  1 nvidia_drm
nvidia_uvm           1142784  0
nvidia              35459072  2 nvidia_uvm,nvidia_modeset
drm_kms_helper        184320  1 nvidia_drm
drm                   421888  4 drm_kms_helper,nvidia,nvidia_drm
i2c_core               77824  3 drm_kms_helper,nvidia,drm
[ec2-user@- ~]$ sudo rmmod nvidia_uvm
[ec2-user@- ~]$ sudo rmmod nvidia_drm
[ec2-user@- ~]$ sudo rmmod nvidia_modeset
[ec2-user@- ~]$ sudo rmmod nvidia
[ec2-user@- ~]$ nvidia-smi

This seems a bit bonkers, as it's a regression in the absence of a new AMI or any changes to our application or AWS resources. What causes this spontaneous mismatch and how can we work around it in an automated fashion?

已提問 1 年前檢視次數 84 次
沒有答案

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南