Problem on Application load balancer with rule: Health check only responds on the default rule

0

Hi everyone I have 3 microservices running on an ECS cluster. Each microservice is launched by a Fargate task. Each microservice runs in its own Docker container.

  • Microservice A responds on port 8083.
  • Microservice B responds on port 8084.
  • Microservice C responds on port 8085.

My configuration consists of two public subnets, two private, an internet gateway and a NAT, as well as two security groups, one for fargate services and one for ALB. On the security groups I have enabled inbound traffic on all ports.

I have defined a listner for the ALB that responds on port 80 and wrote some path-based rules to route requests to the appropriate target group (every target group is a Target type) :Enter image description here

Only the health check of the target group that responds to the default rule responds ( but I suspect it all happens randomly) , and consequently only the service reachable on port 8083 works

Enter image description here

The remaining target groups are unreachable. What you notice is that in the "Registered Target" section the assigned IP addresses change continuously. For example:

Enter image description here
Enter image description here

But every time IP assigned it generates a timeout. It can happen quite randomly that a certain IP address is registered correctly.

These are the ECS configurations of one of the unresponsive services:

Enter image description here

What is the problem and how can I solve it? Thank you.

UPDATE1

I tried to add a new instance for microservice A. For the new IP (10.0.0.137) the health check is not responding. After a few minutes, the provisioning of a new IP (10.0.0.151) appears and it is registered correctly:

Enter image description here

UPDATE2

It is really strange behavior. All services are now connected correctly, after several hours of failed attempts. It looks like an IP address assignment problem. Before finding the correct address, AWS makes several attempts with different IP addresses until it randomly finds the correct one. These are the CIDRs of my PRIVATE subnets

  • private_subnets = ["10.0.0.128/28", "10.0.0.144/28"]
  • public_subnets = ["10.0.0.0/28", "10.0.0.16/28"]

While these are the IPs that connected successfully:

  1. 10.0.0.136 (micorservice A istance1)
  2. 10.0.0.151 (micorservice A istance2)
  3. 10.0.0.153 (micorservice A istance3)
  4. 10.0.0.152 (micorservice B)
  5. 10.0.0.142 (Microservice C)
3 Answers
0

My first guess would be the security group in the ECS services. You need to make sure you opened them on ports 8084 and 8085 as well as 8083 towards the ALB security group.

"Time out" is often caused by security groups.

Hope it helps!

//Carl

profile picture
answered 2 years ago
  • Hi Carl and thanks for the reply. On the TG I have enabled all incoming traffic but the result does not change. Also if I try to add a new instance on the target group on which health check responds, the new IP just added is unreachable

0

OK, then the "easy and obvious" solution did not do the trick, so you need to verify everything again 😩

The AWS definition of this error is:

HTTP 408: Request timeout: The client did not send data before the idle timeout period expired. Sending a TCP keep-alive does not prevent this timeout. Send at least 1 byte of data before each idle timeout period elapses. Increase the length of the idle timeout period as needed.

So let's start from the top:

  1. Is the application and container configured to use port 8084? (test this locally)
  2. Are your task definition actually using the correct container image?
  3. Increase the timeout on the health check.
  4. Increase the CPU and RAM on the TASK to make it start quicker.
  5. Is there any reason why the request could take too long? (reconfigure the endpoint to give a quick mock response).
  6. Does the App do anything that could "freeze" on boot?

There is probably a bunch both other things you need to check, but this is what I came up with on te top of my head.

Sorry that I can be more specific!

Good luck and please tell me what it was when you figure it out 😊

profile picture
answered 2 years ago
  • Thanks for your suggestions. I suspect that it may be a problem with subnets, as I wrote in the UPDATE2 of the post

0

I have found the cause of this strange behavior. I had set a timeout value too low for the health check function. Increasing this value solved the problem.

zar1978
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions