How can I work around spontaneous nvml mismatch errors in AWS ECS gpu image?

0

We're running g4dn.xlarges in a few ECS clusters for some ML services, and use the AWS-provided GPU-optimized ECS AMI (https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Images:visibility=public-images;imageId=ami-07dd70259efc9d59b). This morning at around 7-8am PST (12/7/2022), newly-provisioned container instances stopped being able to register with our ECS clusters.

After some poking around on the boxes and reading /var/log/ecs/ecs-init.log, it turned out that we were getting errors in nvml that prevented the ECS init routine from completing:

[ERROR] Nvidia GPU Manager: setup failed: error initializing nvidia nvml: nvml: Driver/library version mismatch

This is the same AMI as some older instances in the cluster that started up fine. We noticed the issue simultaneously across 4 different clusters. Manually killing and restart nvidia components on individual hosts resolved the mismatch and allowed ECS init to complete (and the instances to become available for task allocation):

[ec2-user@- ~]$ lsmod | grep nvidia
nvidia_drm             61440  0
nvidia_modeset       1200128  1 nvidia_drm
nvidia_uvm           1142784  0
nvidia              35459072  2 nvidia_uvm,nvidia_modeset
drm_kms_helper        184320  1 nvidia_drm
drm                   421888  4 drm_kms_helper,nvidia,nvidia_drm
i2c_core               77824  3 drm_kms_helper,nvidia,drm
[ec2-user@- ~]$ sudo rmmod nvidia_uvm
[ec2-user@- ~]$ sudo rmmod nvidia_drm
[ec2-user@- ~]$ sudo rmmod nvidia_modeset
[ec2-user@- ~]$ sudo rmmod nvidia
[ec2-user@- ~]$ nvidia-smi

This seems a bit bonkers, as it's a regression in the absence of a new AMI or any changes to our application or AWS resources. What causes this spontaneous mismatch and how can we work around it in an automated fashion?

preguntada hace un año84 visualizaciones
No hay respuestas

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas