Pods in Fargate stuck PENDING when using pod security groups

0

I have an application running successfully in my EKS cluster with pods scheduled in Fargate, but I'm unable to connect to an RDS database from my pods. I'd like to use Pod Security Groups to grant my pods access to RDS, so I've created a new security group that grants egress to all and added that security group as an allowed ingress to 3306 on my RDS cluster's security group.

I created a SecurityGroupPolicy to assign my new security group and the cluster security group to pods in my application. When that policy is in place and I cycle my pods, they get stuck PENDING showing these events:

Warning  FailedScheduling  27s    fargate-scheduler  Pod provisioning timed out (will retry) for pod: intake-tool-dev/intake-tool-deployment-76b8bb5698-nrqhs

I see in the Security Group for Pods document there are a few notes on what my security groups need:

They must exist: I have verified that both the security groups shown in the pod-sg annotation on my pods exist.

They must allow inbound communication from the security group applied to your nodes (for kubelet) over any ports that you've configured probes for: I don't believe I have configured any probes, but maybe I'm missing something here and I need to be allowing some default set of ports inbound to my pods from the kubelet?

They must allow outbound communication over TCP and UDP ports 53 to a security group assigned to the Pods (or nodes that the Pods run on) running CoreDNS. The security group for your CoreDNS Pods must allow inbound TCP and UDP port 53 traffic from the security group that you specify: My security group allows all egress traffic and I do not have a security group applied to my CoreDNS pods, so I think this is good.

They must have necessary inbound and outbound rules to communicate with other Pods that they need to communicate with. I don't need these pods communicating with any other pods, so I don't think I'm missing anything here.

They must have rules that allow the Pods to communicate with the Kubernetes control plane if you're using the security group with Fargate. The easiest way to do this is to specify the cluster security group as one of the security groups. I have set up my SecurityGroupPolicy to apply both my security group and the cluster security group. (Specifically the module.eks-cluster.cluster_security_group_id security group output from the Terraform EKS module I'm using.)

So I think I have everything in order, though I suspect I may not be understanding what I need to be allowing in from the kubelet.

There's one paragraph a bit lower, but I don't know how this would apply to pods running in Fargate since I don't get to control what instance types my pods are running on:

If any Pods are stuck in the Pending state, confirm that your node instance type is listed in limits.go and that the product of the maximum number of branch network interfaces supported by the instance type multiplied times the number of nodes in your node group hasn't already been met. For example, an m5.large instance supports nine branch network interfaces. If your node group has five nodes, then a maximum of 45 branch network interfaces can be created for the node group. The 46th Pod that you attempt to deploy will sit in Pending state until another Pod that has associated security groups is deleted.

I'm pretty sure the problem here is me denying the control plane something it needs to determine pod health, but I'm not sure what that might be. Any thoughts on next things to check?

1 Answer
0

It seems like you've done a thorough job in setting up the security groups and policies, but there are a few potential areas to investigate further.

  1. Probes and Health Checks:

    • Probes are used by Kubernetes to determine the health of your pods. If you haven't explicitly defined probes in your pod specifications, Kubernetes uses default settings. Ensure that your security groups allow inbound communication from the security group applied to your nodes (for kubelet) over any ports that may be configured for probes.
    • Check your pod specifications for liveness and readiness probes. These probes could be using ports that are not allowed by your security groups.
  2. CoreDNS Communication:

    • Although you mentioned that your security group allows all egress traffic, ensure that the security groups for your pods are indeed allowing outbound communication over TCP and UDP ports 53 to a security group assigned to the Pods (or nodes that the Pods run on) running CoreDNS.
    • Confirm that your CoreDNS security group allows inbound TCP and UDP port 53 traffic from the security group specified.
  3. Control Plane Communication:

    • In Fargate, you might not directly control the node instances, but you still need to ensure that your security groups allow the pods to communicate with the Kubernetes control plane.
    • Make sure that the security group for your pods allows the necessary communication with the Kubernetes control plane. You can specify the cluster security group as one of the security groups in your Pod Security Policy.
  4. Instance Type Limitations:

    • Even though you're using Fargate and don't directly control the instance types, it's worth checking if there are any known limitations or issues related to Fargate instance types.
    • Ensure that the Fargate instance type you're using is supported and that there are no restrictions that might be causing the pods to get stuck in the Pending state.
  5. Logs and Events:

    • Check the logs and events associated with your pods, nodes, and the Fargate infrastructure. They might provide more specific information about why the pods are stuck in the Pending state.
  6. Network Interfaces:

    • Although you don't directly manage instance types in Fargate, it might be worth checking if there are any limitations or known issues related to network interfaces in Fargate.
  7. AWS Support:

    • If the issue persists, consider reaching out to AWS support. They can provide more in-depth analysis and assistance, especially if there are specific platform-related issues or limitations that need attention.

By systematically going through these points, you should be able to identify and resolve the issue causing your pods to be stuck in the Pending state.

profile pictureAWS
Renato
answered 4 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions