EKS Cluster Autoscaler cannot scale nvidia.com/gpu nodes from zero

0

We use EKS with cluster autoscaler (CA) to scale autoscaling node groups (ASGs). We also use nvidia-device-plugin, https://github.com/NVIDIA/k8s-device-plugin, to manage the nvidia.com/gpu resources in the cluster. The min size is configured as zero. When there is workload scheduled, we'd like to scale up the ASG automatically through CA. In addition, our ASGs are tagged correctly with "k8s.io/cluster-autoscaler/node-template/label" as instructed in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-scale-a-node-group-to-0. However, when a pod try to get scheduled, it receives a NotTriggerScaleUp event with message 1 Insufficient nvidia.com/gpu preventing the ASG from scaled up. From what I could tell, this behavior is inconsistent. The scale up event happens successfully at some percentage of the time while all other factors were kept the same.

ssheng
已提問 9 個月前檢視次數 91 次
沒有答案

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南