EKS Cluster Autoscaler cannot scale nvidia.com/gpu nodes from zero

0

We use EKS with cluster autoscaler (CA) to scale autoscaling node groups (ASGs). We also use nvidia-device-plugin, https://github.com/NVIDIA/k8s-device-plugin, to manage the nvidia.com/gpu resources in the cluster. The min size is configured as zero. When there is workload scheduled, we'd like to scale up the ASG automatically through CA. In addition, our ASGs are tagged correctly with "k8s.io/cluster-autoscaler/node-template/label" as instructed in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-scale-a-node-group-to-0. However, when a pod try to get scheduled, it receives a NotTriggerScaleUp event with message 1 Insufficient nvidia.com/gpu preventing the ASG from scaled up. From what I could tell, this behavior is inconsistent. The scale up event happens successfully at some percentage of the time while all other factors were kept the same.

ssheng
gefragt vor 9 Monaten91 Aufrufe
Keine Antworten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen