ALB health checks fail only when using an ECS Capacity Provider

0

Hi,

I have a CDK stack that creates an application load balanced ECS service on a cluster in my VPC. When I create the cluster with hard-coded capacity instances, the services deploy and reach steady state and the application works as expected. When I remove the hard-coded instances and instead provide an EC2 capacity provider and auto scaling group, I cannot deploy the application. The ECS service is created and tasks are placed, and the tasks even show a "Healthy" state in the ECS dashboard. The application logs show successful healthchecks (i.e. 200 statuses are being returned periodically) but after a few minutes, ECS kills the healthy task. ECS indicates the task was killed, with an event listed: "service Foo instance Foo port 32768 is unhealthy in target-group Foo due to (reason Request timed out)". When I examine the killed task after shutdown, the console reports "Task failed ELB health checks in (target-group arn:aws:elasticloadbalancing:us-west-2:Foo:Foo/Foo)".

I suspect that switching to the instances provided by the capacity provider has complicated the networking situation, resulting in the ELB being unable to reach the task. The requests being logged are likely just container health checks. What I can't figure out is how to repair the load balancer's path to the containers - when I create the autoscaling group, I'm passing in the same VPC as I'm providing to the cluster (and that cluster is passed to the ApplicationLoadBalancedEc2Service CDK construct).

Here is a working code example with hard-coded capacity instances:

const vpc = new Vpc(this, `vpc-${props.environmentName}`, {});

const clusterId = `cluster-${props.environmentName}`;
const cluster = new ECS.Cluster(this, `my-cluster`, {
            clusterName: clusterId,
            vpc,
            capacity: {
                  instanceType: new EC2.InstanceType(props.clusterInstanceType),
            },
            containerInsights: true,
        });

const serverTaskId = `server-task-${props.environmentName}`
const serverTaskDefinition = new ECS.TaskDefinition(this, serverTaskId, {
            compatibility: ECS.Compatibility.EC2,
});

const serverContainer = serverTaskDefinition.addContainer('ServerContainer', {
            image: ContainerImage.fromEcrRepository(serverRepo, latestTag),
            containerName: 'my-server',
            memoryReservationMiB: 1024,
            portMappings: [
                {
                    containerPort: Port.HTTP,
                }
            ],
            healthCheck: {
                command: [ `curl localhost/healthcheck` ],
                interval: cdk.Duration.seconds(120),  
            },
        });

const loadBalancedAPIService = new ECSPatterns.ApplicationLoadBalancedEc2Service(this, `server-service`, {
            cluster,
            taskDefinition: serverTaskDefinition,
            cpu: 256,
            memoryReservationMiB: 512,
            desiredCount: 2,
            protocol: LoadBalancingV2.ApplicationProtocol.HTTPS,
            openListener: true,
            domainZone: hostedZone,
            certificate: cert,
            publicLoadBalancer: true,
            maxHealthyPercent: 200,
            minHealthyPercent: 100,
});
loadBalancedAPIService.targetGroup.configureHealthCheck({
            path: "/healthcheck",
            port: 'traffic-port',
            protocol: LoadBalancingV2.Protocol.HTTP,
            interval: cdk.Duration.seconds(120),
            healthyThresholdCount: 2,
            unhealthyThresholdCount: 2,
});

Here is the broken code example with the EC2 capacity provider and autoscaling group:

const vpc = new Vpc(this, `vpc-${props.environmentName}`, {});
const clusterId = `cluster-${props.environmentName}`;
const cluster = new ECS.Cluster(this, `my-cluster`, {
            clusterName: clusterId,
            vpc,
            containerInsights: true,
});

const autoScalingGroup = new AutoScaling.AutoScalingGroup(this, `cluster-asg`, {
            vpc,
            instanceType: new EC2.InstanceType(props.clusterInstanceType),
            machineImage: ECS.EcsOptimizedImage.amazonLinux2(),
            maxCapacity: props.maxClusterInstanceCount,
            minCapacity: 1,
            healthCheck: AutoScaling.HealthCheck.ec2({
                grace: cdk.Duration.seconds(240),
            })
        });

        const capacityProvider = new ECS.AsgCapacityProvider(this, `cluster-capacity-provider`, {
            autoScalingGroup,
            maximumScalingStepSize: 2,
            minimumScalingStepSize: 1,
            canContainersAccessInstanceRole: true,
        });

        const capacityProviderStrategy: ECS.CapacityProviderStrategy = {
            capacityProvider: capacityProvider.capacityProviderName,
            weight: 1,
            base: 1,
        };

        cluster.addAsgCapacityProvider(capacityProvider, {
            canContainersAccessInstanceRole: true,
            machineImageType: ECS.MachineImageType.AMAZON_LINUX_2,
        });

        cluster.addDefaultCapacityProviderStrategy([
            capacityProviderStrategy,
        ]);

const serverTaskId = `server-task-${props.environmentName}`
const serverTaskDefinition = new ECS.TaskDefinition(this, serverTaskId, {
            compatibility: ECS.Compatibility.EC2,
});

const loadBalancedAPIService = new ECSPatterns.ApplicationLoadBalancedEc2Service(this, `server-service`, {
            cluster,
            healthCheckGracePeriod: cdk.Duration.minutes(3),
            taskDefinition: serverTaskDefinition,
            cpu: 256,
            memoryReservationMiB: 512,
            desiredCount: 1,
            protocol: LoadBalancingV2.ApplicationProtocol.HTTPS,
            openListener: true,
            domainZone: hostedZone,
            certificate: cert,
            publicLoadBalancer: true,
            maxHealthyPercent: 200,
            minHealthyPercent: 100,
            capacityProviderStrategies: [
                {
                    capacityProvider: capacityProvider.capacityProviderName,
                    weight: 1,
                    base: 1,
                }
            ],
});
loadBalancedAPIService.targetGroup.configureHealthCheck({
            path: "/healthcheck",
            port: 'traffic-port',
            protocol: LoadBalancingV2.Protocol.HTTP,
            interval: cdk.Duration.seconds(120),
            healthyThresholdCount: 2,
            unhealthyThresholdCount: 2,
});

I have tried adding explicit security groups to the load balancer and auto scaling group in the broken code above by adding:

const loadBalancerSG = new EC2.SecurityGroup(this, `loadbalancer-egress`, {
            vpc,
            allowAllIpv6Outbound: true,
});
loadBalancerSG.addIngressRule(EC2.Peer.ipv4('0.0.0.0/0'), EC2.Port.tcp(Port.HTTPS));

const allowInstanceTrafficFromAlbSecurityGroup = new EC2.SecurityGroup(this, `asg-ingress`, {
            vpc,
            allowAllOutbound: true,
});
allowInstanceTrafficFromAlbSecurityGroup.addIngressRule(loadBalancerSG, EC2.Port.tcp(Port.HTTP), "allow HTTP traffic from load balancer");
allowInstanceTrafficFromAlbSecurityGroup.addIngressRule(loadBalancerSG, EC2.Port.tcp(Port.HTTPS), "allow HTTPS traffic from load balancer");
autoScalingGroup.addSecurityGroup(allowInstanceTrafficFromAlbSecurityGroup);

I have tried the troubleshooting steps in this document (https://repost.aws/knowledge-center/troubleshoot-unhealthy-checks-ecs) which led me to add the security group definitions above, and now I'm seeing health check requests in the service logs BOTH from localhost and an IP in the 10.x.x.x range, which I assume to be the load balancer. Dozens of health check requests are succeeding, but still, the task is terminated with the same "Task failed ELB health checks" message.

Any advice would be appreciated, I am a bit out of my depth here and am not sure how the capacity provider is affecting the ALB networking. Thanks in advance!

1 Answer
1

Are the ELB healthchecks still failing with a timeout as the reason after adding the SGs?

If you're seeing more healthcheck requests come in, then I would guess the SG changes allowed traffic to start flowing, but 2 things to confirm:

  1. You can check the ELB ENI (Network Interfaces section of the EC2 console) to confirm the ELB currently has the same IP as you're seeing in the logs
  2. Try using the Reachability Analyzer to confirm everything is configured correctly

The ELB SG needs to allow outbound traffic to the EC2 instances; and the instances need to allow inbound traffic from the ELB. I'm not confident enough in my CDK to know for sure these are both being added, so I would check the security group settings manually after they're created.

The capacity provider is linked to an ASG, which is spinning up EC2 instances. There isn't anything about the ASG or Capacity Provider themselves which are inherently going to cause healthcheck/connectivity issues, so its likely some difference in the instance themselves which are being launched. You said the Subnets are the same for both working and non-working instances, so that automatically rules out a lot of things (Network ACLs, Route Tables; for both the ELB nodes and the backend instances)

AWS
answered 20 days ago
profile picture
EXPERT
reviewed 18 days ago
  • Hi Shahad, thank you for your answer - I was able to use the EC2 network interfaces console to confirm the healthcheck requests are coming from network interfaces associated with the load balancer. The Reachability Analyzer shows the instance created by the capacity provider is reachable from the ALB network interface.

    The task events still show "instance is unhealthy due to (reason Request timed out)". I've tried extending the grace period for the ELB health check, and the task stays alive as long as the grace period is in effect, but then dies once the grace period ends.

  • Are the tasks taking longer to reply than the ELB Target Groups healthcheck timeout? Otherwise, it sounds strange that it would be reachable, but still getting timeouts. Unless the tasks are being registered on dynamic ports which aren't included in the security groups. You may want to open a technical support case so someone can look at the setup of your specific resources

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions