I have a .NET API application deployed as an ECS Fargate task. This task makes internal calls to an on-premises server. The ECS task is hosted in a VPC with a private subnet and connects to the on-premises server via a transit gateway.
To simplify the scenario, we conducted tests by directly accessing the private IP of the task. When the test involved only a few calls, the task responded correctly. And it is a simple POST API call with no much data back and forth.
However, when we increased the number of virtual users to 200 and ramped up the test (i.e., a performance test), we observed that after a minute, the internal call from the task to the on-premises server started timing out. (Usually the call only takes 3sec, and the timeout time is set 1 min)
Some key observations:
-
We saw clear timeout errors in the application logs. While some requests were processed successfully, others failed due to timeouts.
-
For the requests that failed with timeout errors, we confirmed that the on-premises server never received those requests, so it's not an issue with the on-premises server's performance.
-
The ECS task's CPU and memory usage remained within normal limits. Even after allocating additional CPU and memory, the issue persisted.
-
There are no firewall rules or policies on either the AWS or on-premises side that block the traffic.
-
Scale the ecs service up with a few more tasks, and let the tests hit the LB instead of a single container, the issue continues.
-
We use fargate with awsvpc network mode as default
Since it is just a test with only 200 virtual users, do not think bandwidth is the root.
It was mentioned in some articles that fargate task has some outbound tcp connection limitation, but aws doc did not mention that.
Not sure if it app config issue or infra issue or networking issue.
This is the egress config for the security group which seems good to me.
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
Is there any app coding issue or task definition seems wrong? Thanks.
This is task definition in json (with some real account, arn..etc removed):
{
"taskDefinitionArn": "arn",
"containerDefinitions": [
{
"name": "name",
"image": "...",
"cpu": 8192,
"memory": 32768,
"portMappings": [
{
"containerPort": 5004,
"hostPort": 5004,
"protocol": "tcp"
}
],
"essential": true,
"environment": [
...
],
"mountPoints": [],
"volumesFrom": [],
"secrets": [
...
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/mygroup",
"awslogs-region": "us-west-2",
"awslogs-stream-prefix": "ecs"
}
},
"systemControls": []
}
],
"family": "myfamily",
"taskRoleArn": "arn",
"executionRoleArn": "roleArn",
"networkMode": "awsvpc",
"revision": 231,
"volumes": [],
"status": "ACTIVE",
"requiresAttributes": [
{
"name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
},
{
"name": "ecs.capability.execution-role-awslogs"
},
{
"name": "com.amazonaws.ecs.capability.ecr-auth"
},
{
"name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
},
{
"name": "ecs.capability.secrets.asm.environment-variables"
},
{
"name": "ecs.capability.increased-task-cpu-limit"
},
{
"name": "com.amazonaws.ecs.capability.task-iam-role"
},
{
"name": "ecs.capability.execution-role-ecr-pull"
},
{
"name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
},
{
"name": "ecs.capability.task-eni"
}
],
"placementConstraints": [],
"compatibilities": [
"EC2",
"FARGATE"
],
"requiresCompatibilities": [
"FARGATE"
],
"cpu": "8192",
"memory": "32768",
"registeredAt": "2024-08-22T18:30:57.544Z",
"deregisteredAt": "2024-08-22T20:23:45.607Z",
"registeredBy": "...",
"tags": [
....
]
}