EKS Cluster Autoscaler cannot scale nvidia.com/gpu nodes from zero

0

We use EKS with cluster autoscaler (CA) to scale autoscaling node groups (ASGs). We also use nvidia-device-plugin, https://github.com/NVIDIA/k8s-device-plugin, to manage the nvidia.com/gpu resources in the cluster. The min size is configured as zero. When there is workload scheduled, we'd like to scale up the ASG automatically through CA. In addition, our ASGs are tagged correctly with "k8s.io/cluster-autoscaler/node-template/label" as instructed in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-scale-a-node-group-to-0. However, when a pod try to get scheduled, it receives a NotTriggerScaleUp event with message 1 Insufficient nvidia.com/gpu preventing the ASG from scaled up. From what I could tell, this behavior is inconsistent. The scale up event happens successfully at some percentage of the time while all other factors were kept the same.

ssheng
preguntada hace 9 meses91 visualizaciones
No hay respuestas

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas