AWS ECS Fargate Task Throttling Outbound Traffic

0

I have a .NET API application deployed as an ECS Fargate task. This task makes internal calls to an on-premises server. The ECS task is hosted in a VPC with a private subnet and connects to the on-premises server via a transit gateway.

To simplify the scenario, we conducted tests by directly accessing the private IP of the task. When the test involved only a few calls, the task responded correctly. And it is a simple POST API call with no much data back and forth.

However, when we increased the number of virtual users to 200 and ramped up the test (i.e., a performance test), we observed that after a minute, the internal call from the task to the on-premises server started timing out. (Usually the call only takes 3sec, and the timeout time is set 1 min)

Some key observations:

  • We saw clear timeout errors in the application logs. While some requests were processed successfully, others failed due to timeouts.

  • For the requests that failed with timeout errors, we confirmed that the on-premises server never received those requests, so it's not an issue with the on-premises server's performance.

  • The ECS task's CPU and memory usage remained within normal limits. Even after allocating additional CPU and memory, the issue persisted.

  • There are no firewall rules or policies on either the AWS or on-premises side that block the traffic.

  • Scale the ecs service up with a few more tasks, and let the tests hit the LB instead of a single container, the issue continues.

  • We use fargate with awsvpc network mode as default

Since it is just a test with only 200 virtual users, do not think bandwidth is the root.

It was mentioned in some articles that fargate task has some outbound tcp connection limitation, but aws doc did not mention that.

Not sure if it app config issue or infra issue or networking issue.

This is the egress config for the security group which seems good to me.

egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

Is there any app coding issue or task definition seems wrong? Thanks.

This is task definition in json (with some real account, arn..etc removed):

{
  "taskDefinitionArn": "arn",
  "containerDefinitions": [
      {
          "name": "name",
          "image": "...",
          "cpu": 8192,
          "memory": 32768,
          "portMappings": [
              {
                  "containerPort": 5004,
                  "hostPort": 5004,
                  "protocol": "tcp"
              }
          ],
          "essential": true,
          "environment": [
              ...
          ],
          "mountPoints": [],
          "volumesFrom": [],
          "secrets": [
             ...
          ],
          "logConfiguration": {
              "logDriver": "awslogs",
              "options": {
                  "awslogs-group": "/ecs/mygroup",
                  "awslogs-region": "us-west-2",
                  "awslogs-stream-prefix": "ecs"
              }
          },
          "systemControls": []
      }
  ],
  "family": "myfamily",
  "taskRoleArn": "arn",
  "executionRoleArn": "roleArn",
  "networkMode": "awsvpc",
  "revision": 231,
  "volumes": [],
  "status": "ACTIVE",
  "requiresAttributes": [
      {
          "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
      },
      {
          "name": "ecs.capability.execution-role-awslogs"
      },
      {
          "name": "com.amazonaws.ecs.capability.ecr-auth"
      },
      {
          "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
      },
      {
          "name": "ecs.capability.secrets.asm.environment-variables"
      },
      {
          "name": "ecs.capability.increased-task-cpu-limit"
      },
      {
          "name": "com.amazonaws.ecs.capability.task-iam-role"
      },
      {
          "name": "ecs.capability.execution-role-ecr-pull"
      },
      {
          "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
      },
      {
          "name": "ecs.capability.task-eni"
      }
  ],
  "placementConstraints": [],
  "compatibilities": [
      "EC2",
      "FARGATE"
  ],
  "requiresCompatibilities": [
      "FARGATE"
  ],
  "cpu": "8192",
  "memory": "32768",
  "registeredAt": "2024-08-22T18:30:57.544Z",
  "deregisteredAt": "2024-08-22T20:23:45.607Z",
  "registeredBy": "...",
  "tags": [
    ....
  ]
}
1 Answer
0
  1. Investigate TCP Connection Limits: Check if there are any connection limits either at the Fargate task level, NAT Gateway, or Transit Gateway.
  2. Configure .NET HttpClient for High Concurrency: Ensure that HttpClient and the .NET thread pool are configured correctly.
  3. Enable VPC Flow Logs: Analyze the network traffic to identify any anomalies or issues.
  4. Check Task Role Permissions and Scaling: Ensure that the ECS task role has correct permissions and that task scaling is functioning as expected.
  5. Monitor with CloudWatch: Use CloudWatch to monitor logs and metrics for any additional insights.
profile pictureAWS
EXPERT
Deeksha
answered 15 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions