My ECS service failing after task starts

0

I created ECS service task like these

{
    "containerDefinitions": [
        {
            "name": "api",
            "image": "414397229292.dkr.ecr.ap-south-1.amazonaws.com/bids365/backend:79cb8aa",
            "cpu": 256,
            "memoryReservation": 512,
            "portMappings": [
                {
                    "containerPort": 3000,
                    "hostPort": 3000,
                    "protocol": "tcp"
                }
            ],
            "essential": true,
            "environment": [
                {
                    "name": "IMPART_ENV",
                    "value": "dev"
                }
            ],
            "mountPoints": [],
            "volumesFrom": [],
            "disableNetworking": false,
            "privileged": false,
            "readonlyRootFilesystem": true,
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "bids365/backend/dev",
                    "awslogs-region": "ap-south-1",
                    "awslogs-stream-prefix": "dev"
                }
            },
            "healthCheck": {
                "command": [
                    "CMD-SHELL",
                    "curl -f http://localhost:3000/ping || exit 1"
                ],
                "interval": 60,
                "timeout": 2,
                "retries": 3,
                "startPeriod": 60
            }
        }
    ],
    "family": "service-dev",
    "taskRoleArn": "arn:aws:iam::414397229292:role/task-role-backend-dev",
    "executionRoleArn": "arn:aws:iam::414397229292:role/ecs-task-execution-role",
    "networkMode": "awsvpc",
    "volumes": [],
    "placementConstraints": [],
    "requiresCompatibilities": [
        "FARGATE"
    ],
    "cpu": "256",
    "memory": "512"
}

my alb is created like these using terraform

`

resource "aws_lb" "backend" { name = "backend-alb" internal = false load_balancer_type = "application" security_groups = [module.sg.id] subnets = ["subnet-082af865c1410b0ef", "subnet-0da703055394aa446", "subnet-02d9bc7c78c939446"]

enable_deletion_protection = false //todo - change this when we get more clarity

tags = { Environment = "production" } lifecycle { prevent_destroy = false } }

module "sg" { source = "cloudposse/security-group/aws" version = "0.1.3" vpc_id = data.aws_vpc.bid365-backend.id delimiter = "" name = "443-ingress-private-egress" rules = [ { type = "egress" from_port = 0 to_port = 65535 protocol = "TCP" # cidr_blocks = [data.aws_vpc.bid365-backend.cidr_block] self = true }, { type = "ingress" from_port = 443 to_port = 443 protocol = "TCP" # cidr_blocks = ["0.0.0.0/0"] self = true }, { type = "ingress" from_port = 80 to_port = 80 protocol = "TCP" # cidr_blocks = ["0.0.0.0/0"] self = true }, { type = "ingress" from_port = 3000 to_port = 3000 protocol = "TCP" # cidr_blocks = ["0.0.0.0/0"] self = true } ] }

resource "aws_lb_listener" "redirect_non_ssl" { load_balancer_arn = aws_lb.backend.arn port = "80" protocol = "HTTP"

default_action { type = "redirect"

redirect {
  port        = "443"
  protocol    = "HTTPS"
  status_code = "HTTP_301"
}

} }

resource "aws_acm_certificate" "cert" {

domain_name = var.app_dns_entry

validation_method = "DNS"

lifecycle {

prevent_destroy = false

}

}

resource "aws_lb_listener" "app" { load_balancer_arn = aws_lb.backend.arn port = "443" protocol = "HTTPS" ssl_policy = "ELBSecurityPolicy-2016-08" certificate_arn = "arn:aws:acm:ap-south-1:414397229292:certificate/4290c5e1-4b49-40bf-afb5-bedeefd072c2"

default_action { type = "fixed-response" fixed_response { content_type = "text/plain" message_body = "Not Found\n" status_code = "404" } }

lifecycle { prevent_destroy = false } }

`

Everything is correct it seems and task and alb all created but it is failing at health check failed in target group. But the container port is mapped correctly and ping endpoint I created for health check also working correctly if I try to access from the container. Please help me I am stuck here for a long time and tried almost everything

已提问 9 个月前330 查看次数
4 回答
0

Hello.
Is the path used for ALB health checks correct?
ALB performs health checks on "/" by default.
https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html

profile picture
专家
已回答 9 个月前
  • Yes like these

    resource "aws_lb_target_group" "map" { for_each = var.target_groups name = "backend-${each.key}" vpc_id = data.aws_vpc.bid365-backend.id port = 3000 protocol = "HTTP" target_type = "ip" # Specify the target type as "ip" for Fargate health_check { enabled = true interval = 60 port = "traffic-port" path = "/ping" protocol = "HTTP" timeout = 5 healthy_threshold = 2 unhealthy_threshold = 3 matcher = "200" } }

0

Have been in this situation and I feel your pain. Here are a few things that I did

---Check 1

Check the security group inbound on the task/service. And outbound on the ALB

---Check 2

I would try SSHing onto the fargate task. Some helpful instructions on how to do this here

Then increase the healthcheck interval and count on the Targetgroup.

Then once you ssh in figure out of the healthcheck is actually working "http://localhost:3000/ping"

----Check 3

If you can't do check2 maybe try log the ping output in the container for some more clues.

已回答 9 个月前
  • security group created for fargate like these

    ` resource "aws_security_group" "fargate" { name_prefix = "fargate-security-group-"

    vpc_id = "vpc-0370dd3da02a2770f" ingress { from_port = 80 to_port = 80 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] }

    ingress { from_port = 3000 to_port = 3000 protocol = "tcp" security_groups = ["sg-0d73dc6bd50a4d4a1"] # If you don't know the ELB's security group ID, use its CIDR range (e.g., 10.0.0.0/8): }

    egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } } `

    for ssh part I was even able to access the public_ip_of_task:3000/ping return success

0

Hows the security group on the ECS configured?

profile picture
专家
已回答 9 个月前
  • security group created like these `resource "aws_security_group" "fargate" { name_prefix = "fargate-security-group-"

    vpc_id = "vpc-0370dd3da02a2770f" ingress { from_port = 80 to_port = 80 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] }

    ingress { from_port = 3000 to_port = 3000 protocol = "tcp" security_groups = ["sg-0d73dc6bd50a4d4a1"] # If you don't know the ELB's security group ID, use its CIDR range (e.g., 10.0.0.0/8): }

    egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } }`

0

Hi,

Can you try without this "healthcheck" part in your Task definition

"healthCheck": {
    "command": [
        "CMD-SHELL",
        "curl -f http://localhost:3000/ping || exit 1"
    ],
    ...
}

If your task is working, it means that it's the problem. So probably, you have to allow localhost connection in your security group. Add this:

ingress {
    protocol    = "-1"
    cidr_blocks = ["127.0.0.1/32"]
}

By the way, you don't need to put CMD-SHELL because the target group also checks this path, it's redondant.

profile picture
Donov
已回答 9 个月前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则