Network Load Balancer (NLB) troubleshooting guide

5 minute read
Content level: Advanced
1

This guide focuses on expediting isolation and resolution of incidents involving NLB. This guide will help you gather the right information to troubleshoot NLB issues efficiently.

Objective

AWS customers rely heavily on Elastic Load Balancing (ELB) for distributing incoming traffic across multiple targets, such as EC2 instances, containers, and IP addresses, in one or more Availability Zones (AZ). ELB supports Application Load Balancers (ALB), Network Load Balancers (ALB), Gateway Load Balancers (GWLB), and Classic Load Balancers (CLB).

This guide focuses on expediting isolation and resolution of incidents involving NLB. This guide will help you gather the right information to troubleshoot NLB issues efficiently.

When an AWS customer experiences application impairment, the prime focus should be to swiftly determine if NLB is contributing to the impairment or degraded performance. Below, are some steps that will help make that determination.

Procedure

  1. Leverage the following NLB performance/observability metrics to ensure NLB is healthy during the reported timeframe of the incident. The Availability Zone (AZ) dimension can be used to isolate the issue to a specific NLB AZ.

    ActiveFlowCount - The total number of concurrent flows (or connections) from clients to targets. This metric includes connections in the SYN_SENT and ESTABLISHED states. TCP connections are not terminated at the load balancer, so a client opening a TCP connection to a target counts as a single flow. A zero/near-zero value indicates problem with firewall or security group issue restricting traffic versus count in millions is indicative of distributed denial-of-service (DDoS) attack. This metric can also help establishing the typical workload metrics from the application, resulting quick determination of anomalous traffic pattern, if any.

    PortAllocationErrorCount - The total number of ephemeral port allocation errors during a client IP translation operation. A non-zero value indicates dropped client connections. Note: Network Load Balancers support 55,000 simultaneous connections or about 55,000 connections per minute to each unique target (IP address and port) when performing client address translation. To fix port allocation errors, add more targets to the target group.

    UnHealthyHostCount - The number of targets that are considered unhealthy. This metric does not include any Application Load Balancers registered as targets. The unhealthy host count metric gives the aggregate number of failed hosts. This metric indicates unhealthy targets for the load balancer

    TCP_Client_Reset_Count - The total number of reset (RST) packets sent from a client to a target. These resets are generated by the client and forwarded by the load balancer.

    TCP_ELB_Reset_Count - If a client or a target sends data after the idle timeout period elapses, it receives a TCP RST packet to indicate that the connection is no longer valid. Additionally, if a target becomes unhealthy, the load balancer sends a TCP RST for packets received on the client connections associated with the target, unless the unhealthy target triggers the load balancer to fail open.

    TCP_Target_Reset_Count - The total number of reset (RST) packets sent from a target to a client. These resets are generated by the target and forwarded by the load balancer.

    Note: TCP_Target_Reset_Count is an ELB metric published in CloudWatch. This monitors the total number of reset (RST) packets sent from a target (Amazon EC2 host) to a client. A reset packet is one with no payload and with the RST bit set in the TCP header flags. These resets are generated by the target and forwarded by the load balancer. Sum is the most useful statistic for this metric. Similarly, the NLB also emits metrics corresponding to resets generated by the load balancer itself (TCP_ELB_Reset_Count) and resets generated by the client (TCP_Client_Reset_Count). For more details please look here

  2. If synthetic monitoring or canary is available, you may be able to isolate if it’s an issue per AZ or specific target. For example, if the end-to-end response time metrics observed during this anomalous behavior are only from a specific AZ or target, then you can narrow the focus area to those specific items, yielding swift resolution.

  3. If a support case is necessary, the below information can help expedite the resolution

    • Architecture diagram for the impaired application(s) with the traffic flow
    • Client Info (ARN/IP Address/EC2 IDs/ENI)
    • NLB Info (ARN/FQDN)
    • Target Info (ARN/IP Address/ENI)
    • Upstream components (ALBs/FQDNs/IP addresses)
    • Impaired application is an existing deployment or newly deployed application.
  4. Procure vpc flow logs from the impaired source and target resources from the impaired application. Note that for NLB, access logs are created only if the load balancer has TLS listener. Additionally, it would be helpful if application logs can be provided containing below values, this will further help narrowing down the root cause

    • $request_time
    • $upstream_connect_time
    • $upstream_header_time
    • $upstream_response_time
  5. If still inconclusive, perform packet capture (pcap) at impaired source and target resources for the application. Coordinated pcap collection effort is expected from both parties i.e., AWS (at host level) and customer (instance level). At AWS, the assigned engineer must collect both ingress and egress packet capture at host level where the NLB is residing. This packet capture activity will be more effective when performed during active issue.

AWS
EXPERT
published 7 months ago3934 views