My ECS service failing after task starts

0

I created ECS service task like these

{
    "containerDefinitions": [
        {
            "name": "api",
            "image": "414397229292.dkr.ecr.ap-south-1.amazonaws.com/bids365/backend:79cb8aa",
            "cpu": 256,
            "memoryReservation": 512,
            "portMappings": [
                {
                    "containerPort": 3000,
                    "hostPort": 3000,
                    "protocol": "tcp"
                }
            ],
            "essential": true,
            "environment": [
                {
                    "name": "IMPART_ENV",
                    "value": "dev"
                }
            ],
            "mountPoints": [],
            "volumesFrom": [],
            "disableNetworking": false,
            "privileged": false,
            "readonlyRootFilesystem": true,
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "bids365/backend/dev",
                    "awslogs-region": "ap-south-1",
                    "awslogs-stream-prefix": "dev"
                }
            },
            "healthCheck": {
                "command": [
                    "CMD-SHELL",
                    "curl -f http://localhost:3000/ping || exit 1"
                ],
                "interval": 60,
                "timeout": 2,
                "retries": 3,
                "startPeriod": 60
            }
        }
    ],
    "family": "service-dev",
    "taskRoleArn": "arn:aws:iam::414397229292:role/task-role-backend-dev",
    "executionRoleArn": "arn:aws:iam::414397229292:role/ecs-task-execution-role",
    "networkMode": "awsvpc",
    "volumes": [],
    "placementConstraints": [],
    "requiresCompatibilities": [
        "FARGATE"
    ],
    "cpu": "256",
    "memory": "512"
}

my alb is created like these using terraform

`

resource "aws_lb" "backend" { name = "backend-alb" internal = false load_balancer_type = "application" security_groups = [module.sg.id] subnets = ["subnet-082af865c1410b0ef", "subnet-0da703055394aa446", "subnet-02d9bc7c78c939446"]

enable_deletion_protection = false //todo - change this when we get more clarity

tags = { Environment = "production" } lifecycle { prevent_destroy = false } }

module "sg" { source = "cloudposse/security-group/aws" version = "0.1.3" vpc_id = data.aws_vpc.bid365-backend.id delimiter = "" name = "443-ingress-private-egress" rules = [ { type = "egress" from_port = 0 to_port = 65535 protocol = "TCP" # cidr_blocks = [data.aws_vpc.bid365-backend.cidr_block] self = true }, { type = "ingress" from_port = 443 to_port = 443 protocol = "TCP" # cidr_blocks = ["0.0.0.0/0"] self = true }, { type = "ingress" from_port = 80 to_port = 80 protocol = "TCP" # cidr_blocks = ["0.0.0.0/0"] self = true }, { type = "ingress" from_port = 3000 to_port = 3000 protocol = "TCP" # cidr_blocks = ["0.0.0.0/0"] self = true } ] }

resource "aws_lb_listener" "redirect_non_ssl" { load_balancer_arn = aws_lb.backend.arn port = "80" protocol = "HTTP"

default_action { type = "redirect"

redirect {
  port        = "443"
  protocol    = "HTTPS"
  status_code = "HTTP_301"
}

} }

resource "aws_acm_certificate" "cert" {

domain_name = var.app_dns_entry

validation_method = "DNS"

lifecycle {

prevent_destroy = false

}

}

resource "aws_lb_listener" "app" { load_balancer_arn = aws_lb.backend.arn port = "443" protocol = "HTTPS" ssl_policy = "ELBSecurityPolicy-2016-08" certificate_arn = "arn:aws:acm:ap-south-1:414397229292:certificate/4290c5e1-4b49-40bf-afb5-bedeefd072c2"

default_action { type = "fixed-response" fixed_response { content_type = "text/plain" message_body = "Not Found\n" status_code = "404" } }

lifecycle { prevent_destroy = false } }

`

Everything is correct it seems and task and alb all created but it is failing at health check failed in target group. But the container port is mapped correctly and ping endpoint I created for health check also working correctly if I try to access from the container. Please help me I am stuck here for a long time and tried almost everything

asked 9 months ago319 views
4 Answers
0

Hello.
Is the path used for ALB health checks correct?
ALB performs health checks on "/" by default.
https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html

profile picture
EXPERT
answered 9 months ago
  • Yes like these

    resource "aws_lb_target_group" "map" { for_each = var.target_groups name = "backend-${each.key}" vpc_id = data.aws_vpc.bid365-backend.id port = 3000 protocol = "HTTP" target_type = "ip" # Specify the target type as "ip" for Fargate health_check { enabled = true interval = 60 port = "traffic-port" path = "/ping" protocol = "HTTP" timeout = 5 healthy_threshold = 2 unhealthy_threshold = 3 matcher = "200" } }

0

Have been in this situation and I feel your pain. Here are a few things that I did

---Check 1

Check the security group inbound on the task/service. And outbound on the ALB

---Check 2

I would try SSHing onto the fargate task. Some helpful instructions on how to do this here

Then increase the healthcheck interval and count on the Targetgroup.

Then once you ssh in figure out of the healthcheck is actually working "http://localhost:3000/ping"

----Check 3

If you can't do check2 maybe try log the ping output in the container for some more clues.

answered 9 months ago
  • security group created for fargate like these

    ` resource "aws_security_group" "fargate" { name_prefix = "fargate-security-group-"

    vpc_id = "vpc-0370dd3da02a2770f" ingress { from_port = 80 to_port = 80 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] }

    ingress { from_port = 3000 to_port = 3000 protocol = "tcp" security_groups = ["sg-0d73dc6bd50a4d4a1"] # If you don't know the ELB's security group ID, use its CIDR range (e.g., 10.0.0.0/8): }

    egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } } `

    for ssh part I was even able to access the public_ip_of_task:3000/ping return success

0

Hows the security group on the ECS configured?

profile picture
EXPERT
answered 9 months ago
  • security group created like these `resource "aws_security_group" "fargate" { name_prefix = "fargate-security-group-"

    vpc_id = "vpc-0370dd3da02a2770f" ingress { from_port = 80 to_port = 80 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] }

    ingress { from_port = 3000 to_port = 3000 protocol = "tcp" security_groups = ["sg-0d73dc6bd50a4d4a1"] # If you don't know the ELB's security group ID, use its CIDR range (e.g., 10.0.0.0/8): }

    egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } }`

0

Hi,

Can you try without this "healthcheck" part in your Task definition

"healthCheck": {
    "command": [
        "CMD-SHELL",
        "curl -f http://localhost:3000/ping || exit 1"
    ],
    ...
}

If your task is working, it means that it's the problem. So probably, you have to allow localhost connection in your security group. Add this:

ingress {
    protocol    = "-1"
    cidr_blocks = ["127.0.0.1/32"]
}

By the way, you don't need to put CMD-SHELL because the target group also checks this path, it's redondant.

profile picture
Donov
answered 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions