I have an Application Load Balancer configured with a ECS Fargate target group, with ALB access logs turned on. Occasionally, I see requests like this in the ALB access logs (queried through Athena):
# type time elb client_ip client_port target_ip target_port request_processing_time target_processing_time response_processing_time elb_status_code target_status_code received_bytes sent_bytes request_verb request_url request_proto user_agent ssl_cipher ssl_protocol target_group_arn trace_id domain_name chosen_cert_arn matched_rule_priority request_creation_time actions_executed redirect_url lambda_error_reason target_port_list target_status_code_list classification classification_reason day
1 h2 2024-04-02T18:27:26.661881Z app/prod-alb-api/ebd9f80c43eb056e [REDACTED-IP] 3500 [REDACTED-TARGET-IP] 8071 0.001 0.004 0.0 202 502 26160 62364 POST [REDACTED-URL] HTTP/2.0 [REDACTED-USER_AGENT] ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 [REDACTED-ARN] Root=1-660c4e0e-0caa78f12803086b243258d9 [REDACTED-DOMAIN] arn:aws:acm:eu-west-1:475615009988:certificate/bd66730b-26a9-4e13-b47a-a50041bb0614 10000 2024-04-02T18:27:26.638000Z waf,forward - - [REDACTED-IP]:8071 502 - - 2024/04/02
The unusual thing about this log is that the 'elb_status_code' is 202, but the 'target_status_code' is 502.
My internal application logs show that my service is returning an HTTP 202 in its response to the incoming load balancer request.
Even if my service was actually returning a 502, my understanding is that the ALB should never be converting that into a 202. What seems to be happening is that the ALB is incorrectly swapping the target_status_code and elb_status_code (and is actually returning a 502 for some reason in response to a 202 from the target group).
This is also incrementing the CloudWatch HTTPCode_Target_5XX_Count
metric, which is causing spurious alarm activation.
The above log entry is representative of all of the log entries I'm seeing with this target_status_code/elb_status_code combination. All of them come from the same external client IP and user agent, which makes me suspect that this client is confusing the ALB with an unusual request. Unfortunately, I'm unable to reproduce this myself, or contact the client causing this.
I just noticed from the access log you shared that the traffic is encrypted, so you wont be able to see the payload of the traffic between the client and the ALB, however you will still be able to see the traffic between the ALB and your backend servers (assuming this is not encrypted as well) which will provide you with some insights to what is happening.
If I understand correctly, this would require mirroring all of our production traffic to a single EC2 instance, and saving it all to disk with tcpdump (since the source ip address will always be the ALB ip address). Are there any less resource-intensive approaches to debugging this?
You can filter the capture based on the
X-Fowarded-For
header value which will contain the client IP. However you will only see the request and not the response.If you can correlate that with the access logs you can then try to replay the same request and hopefully recreate the same issue to further troubleshoot.