PodEvictionFailure: Reached max retries while trying to evict pods from nodes in node group

0

Hi All - I encountered an error while executing Terrafrom. Would you kindly offer any suggestions on resolving it? Here is the error details:

╷
│ Error: waiting for EKS Node Group (ecp-ppp-stage:initial-2024030122380284920000002e) version update (132984bf-4bca-39e4-b851-5adec5a6f9f3): unexpected state 'Failed', wanted target 'Successful'. last error: ip-10-20-23-68.ec2.internal: PodEvictionFailure: Reached max retries while trying to evict pods from nodes in node group initial-2024030122380284920000002e
│ 
│   with module.eks.module.eks_managed_node_group["initial"].aws_eks_node_group.this[0],
│   on .terraform/modules/eks/modules/eks-managed-node-group/main.tf line 338, in resource "aws_eks_node_group" "this":
│  338: resource "aws_eks_node_group" "this" {
│ 
╵
Error: creating Secrets Manager Secret (argocd): operation error Secrets Manager: CreateSecret, https response error StatusCode: 400, RequestID: d1d5b5b9-145c-460b-90bc-8a5b0150c08f, InvalidRequestException: You can't create this secret because a secret with this name is already scheduled for deletion.
│ 
│   with aws_secretsmanager_secret.argocd,
│   on main.tf line 217, in resource "aws_secretsmanager_secret" "argocd":
│  217: resource "aws_secretsmanager_secret" "argocd" {
│ 
╵
╷
│ Error: Unable to continue with install: ServiceAccount "argocd-application-controller" in namespace "argocd" exists and cannot be imported into the current release: invalid ownership metadata; label validation error: missing key "app.kubernetes.io/managed-by": must be set to "Helm"; annotation validation error: missing key "meta.helm.sh/release-name": must be set to "argo-cd"; annotation validation error: missing key "meta.helm.sh/release-namespace": must be set to "argocd"
│ 
│   with module.gitops_bridge_bootstrap.helm_release.argocd[0],
│   on .terraform/modules/gitops_bridge_bootstrap/main.tf line 4, in resource "helm_release" "argocd":
│    4: resource "helm_release" "argocd" {

Here is the eks terraform module i am using.

################################################################################
# EKS Cluster
################################################################################

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name                   = local.name
  cluster_version                = local.cluster_version

  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = true

  enable_irsa = true

  # # Give the Terraform identity admin access to the cluster
  # # which will allow resources to be deployed into the cluster
  enable_cluster_creator_admin_permissions = true

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  # EKS Managed Node Group(s)
  eks_managed_node_groups = {
    initial = {
      instance_types = ["m5.large"]

      min_size     = 1
      max_size     = 3
      desired_size = 2
    }
  }

  # EKS Addons
  cluster_addons = {
    coredns    = {
      most_recent = true
    }
    kube-proxy = {
      most_recent = true
    }
    aws-ebs-csi-driver   = {
      service_account_role_arn = module.ebs_csi_driver_irsa.iam_role_arn
    }
    vpc-cni = {
      # Specify the VPC CNI addon should be deployed before compute to ensure
      # the addon is configured before data plane compute resources are created
      # See README for further details
      before_compute = true
      most_recent    = true # To ensure access to the latest settings provided
      configuration_values = jsonencode({
        env = {
          # Reference docs https://docs.aws.amazon.com/eks/latest/userguide/cni-increase-ip-addresses.html
          ENABLE_PREFIX_DELEGATION = "true"
          WARM_PREFIX_TARGET       = "1"
        }
      })
    }
  }

  tags = local.tags
}

module "ebs_csi_driver_irsa" {
  source  = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
  version = "~> 5.20"

  role_name_prefix = "${module.eks.cluster_name}-ebs-csi-"

  attach_ebs_csi_policy = true

  oidc_providers = {
    main = {
      provider_arn               = module.eks.oidc_provider_arn
      namespace_service_accounts = ["kube-system:ebs-csi-controller-sa"]
    }
  }
}

module "vpc_cni_irsa" {
  source  = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
  version = "~> 5.20"

  role_name_prefix = "${module.eks.cluster_name}-vpc-cni-"

  attach_vpc_cni_policy = true
  vpc_cni_enable_ipv4   = true

  oidc_providers = {
    main = {
      provider_arn               = module.eks.oidc_provider_arn
      namespace_service_accounts = ["kube-system:aws-node"]
    }
  }
}

1 Answer
1

Kindly do the below checks based on the error you listed :-

PodEvictionFailure: This error indicates that Kubernetes failed to evict pods from nodes in a node group, likely due to resource constraints or Pod Disruption Budget (PDB) limitations. Here are some steps to troubleshoot this issue:

Check the resource requests and limits of the pods in the node group. Verify if any Pod Disruption Budgets (PDBs) are defined for the affected pods or namespaces, and adjust them if necessary. Manually drain the node to evict pods gracefully before attempting to delete the node. Secrets Manager Error: This error occurs when attempting to create a Secrets Manager secret with a name that is already scheduled for deletion. To resolve this issue, you need to ensure that the secret name you are trying to create does not conflict with any existing secrets that are pending deletion.

Invalid Ownership Metadata for ServiceAccount: This error indicates an issue with the ownership metadata of a ServiceAccount in the argocd namespace. The error message suggests that the ServiceAccount already exists and cannot be imported into the current release due to invalid ownership metadata. To resolve this issue, you may need to update the metadata of the existing ServiceAccount to match the expected values.

To address these errors, you may need to modify your Terraform configuration, adjust resource definitions, or manually intervene to resolve any conflicts or inconsistencies in your Kubernetes environment.

I hope this will help

profile picture
EXPERT
answered 10 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions