EKS Cluster Autoscaler cannot scale nvidia.com/gpu nodes from zero

0

We use EKS with cluster autoscaler (CA) to scale autoscaling node groups (ASGs). We also use nvidia-device-plugin, https://github.com/NVIDIA/k8s-device-plugin, to manage the nvidia.com/gpu resources in the cluster. The min size is configured as zero. When there is workload scheduled, we'd like to scale up the ASG automatically through CA. In addition, our ASGs are tagged correctly with "k8s.io/cluster-autoscaler/node-template/label" as instructed in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-scale-a-node-group-to-0. However, when a pod try to get scheduled, it receives a NotTriggerScaleUp event with message 1 Insufficient nvidia.com/gpu preventing the ASG from scaled up. From what I could tell, this behavior is inconsistent. The scale up event happens successfully at some percentage of the time while all other factors were kept the same.

ssheng
demandé il y a 9 mois91 vues
Aucune réponse

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions