NLB throughput bottleneck

0

Hello!

I am experiencing a bottleneck on the NLB side.

Setup:

  1. An ECS service that receives SMTP traffic.
  2. Lambdas that send emails to that ECS service. One email equals one connection. Case: When the lambdas send emails through NLB, they are able to send around 120K emails per minute. However, when the same number of lambdas and ECS service capacity send directly to ECS tasks, the lambdas are able to send 180K emails per minute.

Observations:

  1. Increasing the ECS service capacity does not affect the number of emails through NLB.
  2. Changing the email size also does not affect the number of emails.
  3. Adding a third availability zone increased NLB throughput to up to ~180K.

Questions:

  1. Are there any connection limits for NLB?
  2. Do you have any clues as to what may be causing such a bottleneck?
1 Answer
0

Without knowing more information, including the specific maximum number of Lambda functions and the maximum number of service instances involved, it is difficult to tell exactly where you may be able to increase throughput.

I suspect your architecture has some other limit than networking that has your throughput maximum of 180K email per minute. Are your Lambda Functions configured to communicate inside your VPC? You might be limited by available IP space in the subnets. You could also be limited by the number of concurrent Lambda Functions. How does your ECS task scale? Connections? CPU? Memory? Something else? Are you seeing rejected connections from the Lambda functions that are being retried? Are the tasks themselves being limited by the calls out to send the emails, having multiple reties, etc? You should instrument your application to get more information.

Here are the quotas for NLB: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-limits.html. The quotas that are adjustable may be impacting, but I don't think it is the first place to look. It's possible that Targets Per Availability Zone per Network Load Balancer is impacting to some effect since adding an AZ increased the throughput (and likely the size of the task group in ECS).

profile pictureAWS
answered 8 months ago
  • Thanks for the attention. In one of test scenarious 72 instance of ECS tasks with 2 az and with 3 az. and the same when testing without load balancer by sending directly to ecs tasks. and 360 lambdas (limit 1000 concurrent lambdas) and they are in the same vpc as NLB and ECS Cluster. By sending directly to tasks they are able to handle more. also that limit of ~120K on 2az and ~180K on 3az is not changing by increasing ECS capacity or increasing lambdas concurrency. In test scenarious there is no autoscaling. So my understanding that something limit that on NLB side...

  • I recommend you look at the potential bottlenecks I called out above first before settling on the NLB as the source.

    That being said, have you looked at the NLB metrics, specifically ActiveFlowCount, PeakPacketsPerSecond, TCP_Client_Reset_Count, TCP_ELB_Reset_Count, and TCP_Target_Reset_Count. You will likely want to use the Availability Zone dimension to get some further insight.

  • Thanks.
    It appears there was an issue with the test setup. The Lambda function had a VPC (Virtual Private Cloud) configuration, and removing the VPC setup from the Lambda function resolved the issue.

  • Good to know!

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions