How can I work around spontaneous nvml mismatch errors in AWS ECS gpu image?

We're running g4dn.xlarges in a few ECS clusters for some ML services, and use the AWS-provided GPU-optimized ECS AMI (https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Images:visibility=public-images;imageId=ami-07dd70259efc9d59b). This morning at around 7-8am PST (12/7/2022), newly-provisioned container instances stopped being able to register with our ECS clusters.

After some poking around on the boxes and reading /var/log/ecs/ecs-init.log, it turned out that we were getting errors in nvml that prevented the ECS init routine from completing:

[ERROR] Nvidia GPU Manager: setup failed: error initializing nvidia nvml: nvml: Driver/library version mismatch

This is the same AMI as some older instances in the cluster that started up fine. We noticed the issue simultaneously across 4 different clusters. Manually killing and restart nvidia components on individual hosts resolved the mismatch and allowed ECS init to complete (and the instances to become available for task allocation):

[ec2-user@- ~]$ lsmod | grep nvidia
nvidia_drm             61440  0
nvidia_modeset       1200128  1 nvidia_drm
nvidia_uvm           1142784  0
nvidia              35459072  2 nvidia_uvm,nvidia_modeset
drm_kms_helper        184320  1 nvidia_drm
drm                   421888  4 drm_kms_helper,nvidia,nvidia_drm
i2c_core               77824  3 drm_kms_helper,nvidia,drm
[ec2-user@- ~]$ sudo rmmod nvidia_uvm
[ec2-user@- ~]$ sudo rmmod nvidia_drm
[ec2-user@- ~]$ sudo rmmod nvidia_modeset
[ec2-user@- ~]$ sudo rmmod nvidia
[ec2-user@- ~]$ nvidia-smi

This seems a bit bonkers, as it's a regression in the absence of a new AMI or any changes to our application or AWS resources. What causes this spontaneous mismatch and how can we work around it in an automated fashion?

主题

计算容器

标签

亚马逊 EC2 亚马逊弹性容器服务 (ECS)

语言

English

rePost-User-3271298

已提问 1 年前84 查看次数

没有答案

最新
投票最多
评论最多

How can I work around spontaneous nvml mismatch errors in AWS ECS gpu image?

相关内容