2 Answers
- Newest
- Most votes
- Most comments
0
You might want to try using https://karpenter.sh/ for autoscaling and see if it improves your utilization. It can also consolidate pods to less nodes, but that might not always be appropriate for running jobs
answered 3 years ago
0
One possible solution, that I'm exploring is:
- start a second scheduler (https://kubernetes.io/docs/tasks/extend-kubernetes/configure-multiple-schedulers/)
- configure that scheduler to use Resource Bin Packing (NodeResourcesFit, MostAllocated, RequestedToCapacityRatio)
- configure the pods to use this second scheduler instead of the default-scheduler via .spec.schedulerName
- this is a challenge in itself, the software that it actually creating the pods need to support the configuration of .spec.schedulerName. For example, in airflow will create short live pods to run tasks, and it's easy to configure the pods tolerations, annotations, labels, etc but not .spec.schedulerName.
answered a year ago
Relevant content
- asked a year ago
- AWS OFFICIALUpdated 2 years ago

We are already using Karpenter and it does not solve the problem but instead makes it worse. Karpenter creates much bigger nodes during scale-up, than cluster autoscaler, which are much more underutilized after the load goes away.
To fix the problem we need to be able to adapt the scheduling policy of the kubernetes scheduler to use its bin packing capability. Then new pods would not be spread across all nearly empty nodes but be bin packed on just some nodes which result in some empty nodes which can then be removed by karpenter.
I have to agree with Leon, at least if the Karpenter Nodepool is using the consolidatePolicy : WhenEmpty. If you use WhenUnderutilized then yes, the nodes will be consolidated but pods will be killed in the process, and many workload do not tolerate that well (for example airflow worker pods) It would be way better if the kubernetes scheduler could favor underutilized nodes via the resource bin packing instead of trying to spread the load between all the available nodes. That way more nodes could reach the whenEmpty condition, at least for my workload where pods are short lived (30 minutes) but I don't want them to be killed.