How do I troubleshoot cluster scaling with Karpenter autoscaling in Amazon EKS?
I want to troubleshoot cluster scaling with the Karpenter autoscaler in Amazon Elastic Kubernetes Service (Amazon EKS).
Resolution
Troubleshoot your issue based on the error message that you receive.
Can't schedule Karpenter Pods because of insufficient Amazon EKS node group instances
Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.
In Karpenter version 0.16.0 and later, the default replica count changed from 1 to 2. For more information, see v0.16.0 on the GitHub website. I there isn't enough node capacity in the cluster to support the configured number of replicas, then you can't schedule Karpenter Pods. Because Karpenter can't provision nodes to run its own Pods, it fails because of insufficient capacity and results in unscheduled Pods. Then, you receive the following error message:
"Warning FailedScheduling 3m default-scheduler 0/1 nodes are available: 1 Insufficient memory."
To resolve this error, take one of the following actions:
Reduce the Karpenter deployment replicas to one
If your Karpenter deployment doesn't require redundancy, then change it to use a single replica. Run the following command:
kubectl scale deployment karpenter --replicas=1
Increase the node capacity for the Karpenter Pods
To run two replicas of Karpenter, make sure that there's sufficient capacity for two replicas. Choose one of the following options:
Scale out the Auto Scaling group
- Increase the minimum instance count in the Auto Scaling group. Run the following command:
Note: Replace your-node-group-name with the name of your Auto Scaling group.aws autoscaling update-auto-scaling-group --auto-scaling-group-name your-node-group-name --min-size 2 --desired-capacity 2 - Make sure that there are nodes that Karpenter doesn't manage. Check the node labels for Karpenter labels, such as karpenter.sh/nodepool. Run the following command:
kubectl get nodes --show-labels | grep karpenter.sh/nodepool
Use existing nodes
If the target existing node or nodes have Karpenter labels such as karpenter.sh/nodepool, then remove the labels. Run the following command:
kubectl label nodes your-node-name karpenter.sh/nodepool-
Note: Replace your-node-name with the name of your node.
Volume attachments and mount failures
When multiple Pods with Persistent Volume Claims (PVCs) are scheduled on the same node, the node might exceed its volume attachment limit. Then, you might receive either of the following error messages:
"Warning FailedAttachVolume pod/example-pod AttachVolume. Attach failed for volume " " : rpc error: code = Internal desc = Could not attach volume " " to node " ": attachment of disk " " failed, expected device to be attached but was attaching"
"Warning FailedMount pod/example-pod Unable to attach or mount volumes: unmounted volumes=[], unattached volumes=[]: timed out waiting for the condition"
To resolve volume attachment and mount failures for PVC-heavy workloads, complete the following steps:
- Apply topologySpreadConstraints and podAntiAffinity to prevent PVC-heavy Pods from being scheduled on the same node. For more information, see topologySpreadConstraints field and Pod affinity example on the Kubernetes website. This action distributes PVC-heavy pods across different nodes to avoid the concentration of volume attachments on a single node.
- Use CSI drivers like Amazon Elastic Block Store (Amazon EBS) Container Storage Interface (CSI) driver (aws-ebs-csi-driver), and add startup taints to your NodePool. These actions make sure that Pods aren't prematurely scheduled on nodes before they're fully ready.
Example configuration for startup taints in Amazon EBS:--yaml-- apiVersion: karpenter.sh/v1 kind: NodePool spec: template: spec: startupTaints: - key: ebs.csi.aws.com/agent-not-ready effect: NoExecute
Deprecated storage plugin error
Karpenter doesn't support deprecated in-tree storage plugins such as Amazon EBS. If you use a statically provisioned Persistent Volume (PV) with an in-tree plugin, then Karpenter can't discover the node's volume attachment limits. This scenario can cause schedule failures, and you might receive the following error message:
"ERROR controller.node_state PersistentVolume source 'AWSElasticBlockStore' uses an in-tree storage plugin which is unsupported by Karpenter and is deprecated by Kubernetes."
To resolve this issue, use CSI drivers for Amazon EBS and update your PV configurations to use the CSI driver.
Schedule or bin-pack failures because of unspecified resource requests
Karpenter bin-packs Pods based on resource requests. If the requests are too low or missing, then Karpenter might allocate too many Pods to the same node. This scenario can lead to resource contention and CPU throttling. Additionally, if memory limits are set and Pods try to use more memory than their limit, then there might be Out-Of-Memory (OOM) terminations. You might receive the following error message:
"Warning OOMKilled pod/your-pod-name Container "container-name" was killed due to OOM (Out of Memory). Memory limit: 512Mi, Memory usage: 513Mi"
To prevent these issues, use LimitRange configurations to set minimum resource requests for accurate bin packing. LimitRange configurations help establish maximum limits to prevent excessive consumption. They also provide default limits for unspecified Pods. For more information, see Use LimitRanges to configure defaults for resource requests and limits.
Windows Pods fail to launch with image pull error
A Pod fails to launch if its container operating system (OS) version doesn't match the Windows OS version. You receive an error message similar to the following:
"Failed to pull image "mcr.microsoft.com/windows/servercore:xxx": rpc error: code = NotFound desc = failed to pull and unpack image "mcr.microsoft.com/windows/servercore:xxx": no match for platform in manifest: not found"
To resolve this issue, define your Pod's nodeSelector to make sure that your containers are scheduled on a compatible OS host version. For more information, see Windows container version compatibility on the Microsoft website.
Nodes not initialized properly
The system determines node initialization based on node readiness, expected resources registration, and the removal of NodePool startup taints. If any of these conditions aren't met, then the Karpenter node fails to initialize properly, and the node remains in a NotReady state. As a result, the system can't use the node to schedule or consolidate workloads. You might receive the following error message:
"Nodes provisioned by Karpenter are in a NotReady state"
Verify that the node state is Ready. If it isn't, then inspect the Kubelet logs to identify potential issues with permissions, security groups, or networking.
Verify that all required resources, such as nvidia.com/gpu or vpc.amazonaws.com/pod-eni, are correctly registered on the node.
To check nvidia.com/gpu resources on the node, run the following command:
kubectl describe node your-node-name
Note: Replace your-node-name with your node's name.
Example output:
... Capacity: nvidia.com/gpu.shared: 80 ...
If these resources are missing, verify that the appropriate daemonset or plugins are running. To check for daemonset, run the following command:
kubectl get ds -n your-daemonset-namespace
Note: Replace your-daemonset-namespace with your daemonset namespace.
Scheduling failures because of various constraints and limitations
Pod can't be scheduled because of affinity, anti-affinity, or topology spread constraints
A Pod isn't scheduled if affinity, anti-affinity, or topology spread constraints require specific nodes or zones, but suitable nodes don't exist in the required locations. If the system can't place a Pod because of node or zone requirements that aren't met, then you might receive the following error message:
"Warning FailedScheduling pod/"pod-name" 0/3 nodes are available: 1 node(s) didn't match pod affinity rules, 2 node(s) didn't match pod topology spread constraints rules, 3 nodes(s) didn't match inter-pod anti-affinity rules."
To resolve this error, review and adjust the Pod's affinity and anti-affinity settings or topology spread constraints to align with available nodes. You can relax these constraints or provision more nodes in the required zones.
Failed to schedule Pod because of insufficient resources
Pods remain unscheduled because of resource requests that exceed the available node capacity. If there aren't any nodes that have sufficient CPU, memory, or other resources to accept the Pod, then you might receive the following error message:
"Warning FailedScheduling 30s (x13 over 60m) default-scheduler 0/5 nodes are available: 1 Insufficient memory. preemption: 0/5 nodes are available: 5 No preemption victims found for incoming pod."
To resolve this issue, make sure that the resource requests in the Pod specification reflect actual usage. Adjust the resource requests and limits if necessary, or provision larger nodes with more capacity to meet the resource demands.
Taints prevent Pods from being scheduled
When cluster administrators apply custom taints to specific nodes, the Pods must have matching tolerations. If they don't have matching tolerations, then the system can't schedule Pods on those nodes. You receive the following error message:
"0/5 nodes are available: 3 node(s) had taint {dedicated: gpu}, that the pod didn't tolerate, 2 node(s) had taint {dedicated: non-gpu}, that the pod didn't tolerate."
To resolve this error, add the appropriate tolerations to the Pod specification to allow it to tolerate the taints. Or, you can remove or modify the unnecessary custom taints on the nodes if they're too restrictive.
To remove a taint from a node, run the following command:
kubectl taint nodes your-node-name your-custom-taint-
Note: Replace your-node-name with the name of your node, and your-custom-taint with the name of your custom taint.
NodeAffinity or NodeSelector constraints aren't satisfied
If there are node affinity or node selector constraints that don't match any available nodes in the cluster, then the scheduler can't place Pods. You receive the following error message:
"Warning FailedScheduling 3m default-scheduler 0/4 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 3 node(s) didn't satisfy existing pods anti-affinity rules, 4 node(s) didn't match Pod's node affinity rules."
To resolve this error, modify the Pod's node affinity or node selector requirements to be more flexible. Or, you can provision additional nodes that meet the Pod's criteria if necessary. For more information, see Node affinity and nodeSelector on the Kubernetes website.
Insufficient IP addresses in subnet
When Karpenter tries to provision new nodes, it fails because of insufficient IP addresses in the subnet. This scenario occurs when the Classless Inter-Domain Routing (CIDR) range of the subnet is exhausted and can't accommodate new Amazon Elastic Compute Cloud (Amazon EC2) instances. You receive the following error message:
"error": "creating machine, creating instance, with fleet error(s), InsufficientFreeAddressesInSubnet: There are not enough free addresses in subnet 'subnet-a to satisfy the requested number of instances."
To resolve this error, take either of the following actions:
If the subnet's IP addresses are depleted, then add an additional IPv4 CIDR block as a secondary CIDR to your Amazon Virtual Private Cloud (Amazon VPC).
-or-
Use custom networking to assign separate IP address spaces to your Pods and nodes. To activate custom networking, run the following command:
kubectl set env daemonset aws-node -n kube-system AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true
For more information on custom networking, see How do I choose specific IP address subnets for Pods in my Amazon EKS cluster?
Can't schedule Pod because of incompatible requirements
Karpenter fails to schedule Pods that specify node group labels such as eks.amazonaws.com/nodegroup that don't match any values defined in the node pool's configurations. If this mismatch occurs, then Karpenter can't place Pods on nodes because of the absence of required node labels. You receive one of the following error messages:
"incompatible requirements, label \"eks.amazonaws.com/nodegroup\" does not have known values""
"incompatible requirements, key topology.kubernetes.io/zone, topology.kubernetes.io/zone In [us-east-1a] not in topology.kubernetes.io/zone In [us-east-1b us-east-1c]"
"incompatible requirements, key nodes.ktp.io/role, nodes.ktp.io/role In [ktp-atom-apps] not in nodes.ktp.io/role In [kube-apps]"
If you want Pods to be schedulable by Karpenter, then remove the managed node group-specific nodeSelector to resolve this error.
Example:
kubectl edit pod your-pod-name or kubectl edit deployment your-deployment-name or kubectl edit daemonset your-daemonset-name
Note: Replace your-pod-name, your-deployment-name, or your-daemonset-name with the name of your Pod, deployment, or daemonset.
Node consolidation failures
Karpenter's node consolidation can fail because of scheduling constraints or specific Pod configurations that prevent Pod migration.
Scheduling constraints
Node consolidation fails when Pods can't be migrated because of the following reasons:
- Inter-Pod affinity or anti-affinity: Pods that require or avoid co-locating with other Pods.
- Topology spread constraints: Pods that must be distributed across different zones, nodes, or racks.
- Other scheduling restrictions: Any other constraints that prevent Pods from moving to other nodes.
Review and adjust the Pod affinity and anti-affinity rules to be less restrictive. Adjust topology spread constraints to allow for more flexibility and other scheduling restrictions that might be too tight.
Pod-specific prevention
If there are certain types of Pods that run on your nodes, then Karpenter might not be able to consolidate the nodes. Karpenter can't evict these Pods because of annotations, scheduling constraints, Pod Disruption Budgets (PDBs), or the lack of a controller owner. Consolidation might fail because Karpenter won't violate these preferences, even if kube-scheduler could technically fit the Pods elsewhere.
- Sujets
- Containers
- Langue
- English

Contenus pertinents
- demandé il y a 3 ans
- demandé il y a 18 jours
- demandé il y a 8 mois
- demandé il y a 8 mois
- demandé il y a 21 jours
AWS OFFICIELA mis à jour il y a 7 mois