ECS Fargate - communication between 2 containers in the same task

0

Hello, I am trying to run Flink job manager and Flink task manager as two separate containers in the same task.

According to:

  1. https://aws.amazon.com/blogs/compute/task-networking-in-aws-fargate/
  2. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/fargate-task-networking.html

task manager should be able to connect to job manager on host 127.0.0.1 or localhost, however I see errors in task manager container:

2024-04-03 18:36:00,630 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Could not resolve ResourceManager address akka.tcp://flink@127.0.0.1:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@127.0.0.1:6123/user/rpc/resourcemanager_*.
2024-04-03 18:20:46,995 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Could not resolve ResourceManager address akka.tcp://flink@localhost:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@localhost:6123/user/rpc/resourcemanager_*.

**MY LATEST PREDICTION IS THAT THE PROBLEM IS RPC TRAFFIC. UPDATES IN COMMENTS SECTION. ** I would appreciate any help.

My task definition below:

{
    "family": "flink-domain-count",
    "containerDefinitions": [
        {
            "name": "jobmanager",
            "image": "623707875154.dkr.ecr.eu-central-1.amazonaws.com/flink-domain-count@sha256:5d0aa235a2f82c9808cd0a9557929511f059e00f211003eaa2e11825e06eddd1",
            "cpu": 0,
            "portMappings": [
                {
                    "name": "flink-domain-count-8081-tcp",
                    "containerPort": 8081,
                    "hostPort": 8081,
                    "protocol": "tcp",
                    "appProtocol": "http"
                },
                {
                    "name": "flink-domain-count-6123-tcp",
                    "containerPort": 6123,
                    "hostPort": 6123,
                    "protocol": "tcp",
                    "appProtocol": "http"
                }
            ],
            "essential": true,
            "command": [
                "/bin/bash",
                "-c",
                "/opt/flink/bin/jobmanager.sh start && sleep 10 && /opt/flink/bin/flink run --python /app/groupby2.py"
            ],
            "environment": [
                {
                    "name": "FLINK_PROPERTIES",
                    "value": "jobmanager.rpc.address: jobmanager"
                }
            ],
            "mountPoints": [],
            "volumesFrom": [],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-create-group": "true",
                    "awslogs-group": "/ecs/flink-domain-count-jobmanager",
                    "awslogs-region": "eu-central-1",
                    "awslogs-stream-prefix": "ecs"
                },
                "secretOptions": []
            },
            "systemControls": []
        },
        {
            "name": "taskmanager",
            "image": "623707875154.dkr.ecr.eu-central-1.amazonaws.com/flink-domain-count@sha256:5d0aa235a2f82c9808cd0a9557929511f059e00f211003eaa2e11825e06eddd1",
            "cpu": 0,
            "portMappings": [],
            "essential": true,
            "command": [
                "taskmanager"
            ],
            "environment": [
                {
                    "name": "FLINK_PROPERTIES",
                    "value": "jobmanager.rpc.address: 127.0.0.1"
               }
            ],
            "mountPoints": [],
            "volumesFrom": [],
            "dependsOn": [
                {
                    "containerName": "jobmanager",
                    "condition": "START"
                }
            ],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-create-group": "true",
                    "awslogs-group": "/ecs/flink-domain-count-taskmanager",
                    "awslogs-region": "eu-central-1",
                    "awslogs-stream-prefix": "ecs"
                },
                "secretOptions": []
            },
            "systemControls": []
        }
    ],
    "taskRoleArn": "arn:aws:iam::623707875154:role/service-role/confluent-apps-role-staging",
    "executionRoleArn": "arn:aws:iam::623707875154:role/service-role/confluent-apps-role-staging",
    "networkMode": "awsvpc",
    "requiresCompatibilities": [
        "FARGATE"
    ],
    "cpu": "512",
    "memory": "1024",
    "runtimePlatform": {
        "cpuArchitecture": "ARM64",
        "operatingSystemFamily": "LINUX"
    }
}
asked a month ago154 views
3 Answers
0

Indeed, containers that belong to the same task can also communicate over the localhost interface. Reference.

Looking at the logs, it looks like you are using the TCP protocol to communicate while the Task Definition PortMapping value is configuring HTTP instead.

From the docs:

appProtocol: If you don't set a value for this parameter, then TCP is used

I'd try to remove the appProtocol to see if it would work. If this doesn't work, I'd check with the Flink team what are the suggestions to have this communication working using localhost on a sidecar container.

Hope this helps!

profile pictureAWS
answered a month ago
0

What we managed to test:

  1. created another task with similar deployment, 1 container fastapi app and 1 container that constantly call app. It's working perfectly without any port mapping. I setup session manager on containers, connected via ssh I tested curl, nc and telnet commands with both localhost and 127.0.0.1 ip.
  2. Did the same on Flink container also there is traffic, however my colleague suggests the problem could be RPC traffic as we found this old issue: https://github.com/grpc/grpc/issues/19633

At the moment we are looking for the way to test it. Any Ideas?

answered a month ago
-1

Each container running on Fargate will receive it's own separate runtime environment. That means that the container cannot connect to another container using the loopback (127.0.0.1) address. You must use the actual IP address for the other container - the one that is allocated to the container within the VPC. You'll need to look this up by using a service discovery tool or by querying the AWS APIs.

Indeed, every day is a learning day - see the answer from Henrique Santana - this is indeed possible.

profile pictureAWS
EXPERT
answered a month ago
  • @Bretski-AWS, thank you for your comment. I don't understand if these knowledge articles are incorrect or my interpretation is wrong? Also, maybe I would be able to achieve what I this with launch type - EC2 and network mode bridge?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions