ALB: unexpected 504 timeouts.

1

Hello, we're running an application with the following setup:

ALB -> TargetGroup -> ECS service

The ECS service is running on Fargate and it is a web-server.

The ALB timeout is set to 20 seconds, and the Service's idle timeout is set to 120 seconds. We've confirmed that the ECS service keeps connections open for at least 120 seconds.

We've also configured request logging on this ALB and we've been querying it with Athena.

However, it seems like the ALB still issues 504s to some requests immediately, we regularly see log entries like these:

#	type	time	elb	client_ip	client_port	target_ip	target_port	request_processing_time	target_processing_time	response_processing_time	elb_status_code	target_status_code	received_bytes	sent_bytes	request_verb	request_url	request_proto	user_agent	ssl_cipher	ssl_protocol	target_group_arn	trace_id	domain_name	chosen_cert_arn	matched_rule_priority	request_creation_time	actions_executed	redirect_url	lambda_error_reason	target_port_list	target_status_code_list	classification	classification_reason
1	h2	2022-04-05T12:29:35.833859Z	app/ProdV2PublicApi/0a57bb243c2024e7	REDACTED	60232	10.42.17.243	18000	0.001	-1.0	-1.0	504	-	307	202	PUT	https://apibeta.centralapp.com:443/api/v2/inbox/table-booking/request-status?urn=urn%3Acentralapp%3Acompany%3A9211291968452165724	HTTP/2.0	Mozilla/5.0 (Windows NT 6.0; rv:52.0) Gecko/20100101 Firefox/52.0	ECDHE-RSA-AES128-GCM-SHA256	TLSv1.2	arn:aws:elasticloadbalancing:eu-west-1:899041400740:targetgroup/ProdV2-Quasar-ECS/3ee8cde1808fdb22	Root=1-624c35f3-5852eff9590d1fd5285bc6e7	apibeta.centralapp.com	arn:aws:acm:eu-west-1:899041400740:certificate/7f1122d9-2470-46d6-b28b-53a48c99bdd1	5	2022-04-05T12:28:35.831000Z	waf,forward	-	-	10.42.17.243:18000	-	-	-

From the backend logs, it seems to us that the request never really gets to the ECS service, or from the ECS service's standpoint, the connection gets prematurely terminated by the ALB.

For the premature termination, we see logs in our backend like these:

[CRITICAL] [05/Apr/2022:10:18:47 +0000] ["Quasar/###WARP###"] Exception on request:Just Request
    { requestMethod = "POST"
    , httpVersion = HTTP/1.1
    , rawPathInfo = "/api/v2/distributor/companies/d-sales/leads"
    , rawQueryString = "?id=centralapp_saas&page=0"
    , requestHeaders =
        [
            ( "X-Forwarded-For"
            , "217.111.215.151"
            )
        ,
            ( "x-amzn-tls-version"
            , "TLSv1.2"
            )
        ,
            ( "x-amzn-tls-cipher-suite"
            , "ECDHE-RSA-AES128-GCM-SHA256"
            )
        ,
            ( "X-Forwarded-Proto"
            , "https"
            )
        ,
            ( "X-Forwarded-Port"
            , "443"
            )
        ,
            ( "Host"
            , "apibeta.centralapp.com"
            )
        ,
            ( "X-Amzn-Trace-Id"
            , "Root=1-624c1784-223215ba597701b31758046f"
            )
        ,
            ( "Content-Length"
            , "40"
            )
        ,
            ( "sec-ch-ua"
            , "" Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99""
            )
        ,
            ( "content-type"
            , "application/json; charset=UTF-8"
            )
        ,
            ( "sec-ch-ua-mobile"
            , "?0"
            )
        ,
            ( "user-agent"
            , "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36"
            )
        ,
            ( "sec-ch-ua-platform"
            , ""macOS""
            )
        ,
            ( "accept"
            , "*/*"
            )
        ,
            ( "origin"
            , "https://beta.centralapp.com"
            )
        ,
            ( "sec-fetch-site"
            , "same-site"
            )
        ,
            ( "sec-fetch-mode"
            , "cors"
            )
        ,
            ( "sec-fetch-dest"
            , "empty"
            )
        ,
            ( "referer"
            , "https://beta.centralapp.com/"
            )
        ,
            ( "accept-encoding"
            , "gzip, deflate, br"
            )
        ,
            ( "accept-language"
            , "fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7"
            )
        ,
            ( "x-amzn-waf-name"
            , "ACL_Public_Resources_Default"
            )
        ]
    , isSecure = False
    , remoteHost = 10.42.17.44:12692
    , pathInfo =
        [ "api"
        , "v2"
        , "distributor"
        , "companies"
        , "d-sales"
        , "leads"
        ]
    , queryString =
        [
            ( "id"
            , Just "centralapp_saas"
            )
        ,
            ( "page"
            , Just "0"
            )
        ]
    , requestBody = <IO ByteString>
    , vault = <Vault>
    , requestBodyLength = KnownLength 40
    , requestHeaderHost = Just "apibeta.centralapp.com"
    , requestHeaderRange = Nothing
    }
Exception was:Warp: Client closed connection prematurely

We've also tried setting things up without an ALB and we've not been able to reproduce this issue locally, so we're pretty suspicious of what the ALB is doing here.

What should be our next course of action on this issue?

We're also noticing that the ALB tries to keep some connections open longer than the timeout:

[DEBUG] [05/Apr/2022:12:58:32 +0000] ["Quasar/###WARP###"] Socket: 10.42.17.105:39070 was open for: 90.897913261s
[DEBUG] [05/Apr/2022:12:58:35 +0000] ["Quasar/###WARP###"] Socket: 10.42.2.83:56654 was open for: 107.734655922s
[DEBUG] [05/Apr/2022:12:58:38 +0000] ["Quasar/###WARP###"] Socket: 10.42.2.83:56552 was open for: 129.749085682s
[DEBUG] [05/Apr/2022:13:14:17 +0000] ["Quasar/###WARP###"] Socket: 10.42.17.105:41128 was open for: 625.691958838s

Why?

asked 2 years ago2494 views
1 Answer
0

ALB sends 504s for the following reasons[1]:

  • The load balancer failed to establish a connection to the target before the connection timeout expired (10 seconds).
  • The load balancer established a connection to the target but the target did not respond before the idle timeout period elapsed.
  • The network ACL for the subnet did not allow traffic from the targets to the load balancer nodes on the ephemeral ports (1024-65535).
  • The target returns a content-length header that is larger than the entity body. The load balancer timed out waiting for the missing bytes.
  • The target is a Lambda function and the Lambda service did not respond before the connection timeout expired.

In your case, it looks like the ALB was not able to establish the connection with your ECS service with in the connection timeout period of 10 seconds. target_processing_time and response_processing_time entries in your Access log confirms the same[2].

Capture VPC flow logs on the ENIs of ALB to find out the time difference between the request and response from ALB to ECS service.

References:

AWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions