EKS Cluster Autoscaler cannot scale nvidia.com/gpu nodes from zero

0

We use EKS with cluster autoscaler (CA) to scale autoscaling node groups (ASGs). We also use nvidia-device-plugin, https://github.com/NVIDIA/k8s-device-plugin, to manage the nvidia.com/gpu resources in the cluster. The min size is configured as zero. When there is workload scheduled, we'd like to scale up the ASG automatically through CA. In addition, our ASGs are tagged correctly with "k8s.io/cluster-autoscaler/node-template/label" as instructed in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-scale-a-node-group-to-0. However, when a pod try to get scheduled, it receives a NotTriggerScaleUp event with message 1 Insufficient nvidia.com/gpu preventing the ASG from scaled up. From what I could tell, this behavior is inconsistent. The scale up event happens successfully at some percentage of the time while all other factors were kept the same.

ssheng
posta 9 mesi fa91 visualizzazioni
Nessuna risposta

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande