How do I troubleshoot an increase in the TargetResponseTime metric for an Application Load Balancer?

5 minute read
0

I noticed an increase in the Application Load Balancer TargetResponseTime metric. How do I troubleshoot this issue?

Short description

TargetResponseTime is the time elapsed, in seconds, between when the request leaves the load balancer and when a response from the target is received. This is equivalent to the target_processing_time field in the Application Load Balancer access logs.

Possible causes of an increase in TargetResponseTime include:

  • The hosts are unhealthy.
  • The backend instances are overwhelmed by too many requests.
  • There's high CPU utilization on the backend instances.
  • One or more of the targets is faulty.
  • There are issues with web application dependencies running on backend instances.

Resolution

The hosts are unhealthy

Verify that all the Application Load Balancer targets are healthy. See, How do I troubleshoot and fix failing health checks for Application Load Balancers?

The backend instances are overwhelmed by too many requests

Check the Sum statistic of the Amazon CloudWatch RequestCount and ActiveConnectionCount metrics for your Application Load Balancer. A sum increase that coincides with the increase in TargetResponseTime can indicate that the backend instances are overwhelmed by the request load.

To resolve this issue, configure an Auto Scaling group for your backend instances. See, Tutorial: Set up a scaled and load-balanced application.

There's high CPU utilization on the backend instances

Check the CloudWatch CPUUtilization metric of your backend instances. If CPU utilization is high, or there is a spike in utilization, upgrade your instances to a larger instance type.

One or more of the targets is faulty

If you're experiencing faulty targets, the follow these steps to resolve the issue:

1.    Enable access logging for your load balancer.

2.    Download the access logs for the time range when TargetResponseTime is high. For example, to download the access logs between 2022-02-01T03:00 and 2022-02-01T03:35 using AWS Command Line Interface (AWS CLI):

aws s3 cp s3://bucket-name[/prefix]/AWSLogs/aws-account-id/elasticloadbalancing/region/2022/02/01/ ./alblogs --recursive --exclude "*" --include "*20220201T03[0123]*"

Note: Replace bucket-name with your bucket's name, aws-account-id with your AWS account's ID, and region with the AWS Region that your account is located in.

3.    Use command line tools to analyze the access logs:

Elastic Load Balancing (ELB) access logs are compressed in a .gzip format.

Optional step: To extract the logs, use the following command:

$   gzip -dr ./alblogs

Example scenarios

To get the maximum latency for target_processing_time, run the following command.

Compressed log file:

$zcat *.log.gz | awk '$7 != -1' | awk 'BEGIN{a=0}{if ($7>0+a) a=$7} END{print a}'

Uncompressed log file:

$cat *.log | awk '$7 != -1' | awk 'BEGIN{a=0}{if ($7>0+a) a=$7} END{print a}'

-or-

To count the number of requests that have a target_processing_time ">=N" seconds per target, modify N with the number of seconds for your requirements.

Example command for compressed log file:

$zcat *.log.gz | awk '{if($7 >= N){print $5}}' | sort | uniq -c

Example command for uncompressed log file:

$cat *.log | awk '{if($7 >= N){print $5}}' | sort | uniq -c

Example output:

12 10.10.20.111:80 
12 10.10.60.163:80
254 10.10.70.7:80
6 10.10.80.109:80
20656 10.3.19.141:80

In the preceding example, the target with IP address 10.3.19.141 accounts for most of the increase in TargetResponseTime. In this case, check the Operating System (OS) and web application for the target.

There are issues with web application dependencies running on backend instances

Run a packet capture on the target to identify the delay in target response. For Linux OS use tcpdump.

To capture a complete incoming and outgoing POST HTTP transmission, including HTTP request and response on port TCP/80:

tcpdump -i any -ns 0 -A 'tcp dst port 80 and tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x504F5354 or tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x48545450 or tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x3C21444F'

To capture a complete incoming and outgoing GET HTTP transmission, including HTTP request and response on port TCP/80:

tcpdump -i any -ns 0 -A 'tcp dst port 80 and tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420 or tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x48545450 or tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x3C21444F'

Example outputs:

14:04:12.186593 IP 10.10.30.219.13000 > 10.10.60.10.http: Flags [P.], seq 641705534:641705793, ack 1587610435, win 106, options [nop,nop,TS val 1165674323 ecr 263805247],
    length 259: HTTP: GET / HTTP/1.1 E..7."@...I. .. < 2..P&?.>^..C...j9...... Ez.S..Y?GET / HTTP/1.1 X-Forwarded-For: 54.66.76.204 X-Forwarded-Proto: http X-Forwarded-Port: 80 Host: labalbinternet-1236602672.ap-southeast-2.elb.amazonaws.com
    X-Amzn-Trace-Id: Root=1-6254355c-15db4904726649b66a1e47d7 User-Agent: curl/7.79.1 Accept: */* ................
14:04:21.187892 IP 10.10.60.10.http > 10.10.30.219.13000: Flags [P.], seq 1:592, ack 259, win 488, options [nop,nop,TS val 263814250
    ecr 1165674323], length 591: HTTP: HTTP/1.1 200 OK E...\.@.@.l. < ...P2.^..C&?.A....qn..... ..|jEz.SHTTP/1.1 200 OK Date: Mon, 11 Apr 2022 14:04:12 GMT Server: Apache/2.4.52 () OpenSSL/1.0.2k-fips X-Powered-By: PHP/7.2.34 Upgrade: h2,h2c
    Connection: Upgrade Transfer-Encoding: chunked Content-Type: text/html; charset=UTF-8 159 PHP file name: /index.php<br> ................

Note: In the preceding example outputs, the GET HTTP responds at 14:04:12 and the target responds at 14:04:21. The TargetResponseTime is approximately 9 seconds. You can use X-Amzn-Trace-Id: Root to trace this record in the access logs.

Example command for compressed log file:

$zcat *.log.gz | awk '{if($20 ~ "1-6254355c-15db4904726649b66a1e47d7"){print $6,$7,$8 }}'

Example command for uncompressed log file:

$cat *.log | awk '{if($20 ~ "1-6254355c-15db4904726649b66a1e47d7"){print $6,$7,$8 }}'

Example output:

0.008 9.002 0.000

AWS OFFICIAL
AWS OFFICIALUpdated 2 years ago