Hello, we're running an application with the following setup:
ALB -> TargetGroup -> ECS service
The ECS service is running on Fargate and it is a web-server.
The ALB timeout is set to 20 seconds, and the Service's idle timeout is set to 120 seconds. We've confirmed that the ECS service keeps connections open for at least 120 seconds.
We've also configured request logging on this ALB and we've been querying it with Athena.
However, it seems like the ALB still issues 504s to some requests immediately, we regularly see log entries like these:
# type time elb client_ip client_port target_ip target_port request_processing_time target_processing_time response_processing_time elb_status_code target_status_code received_bytes sent_bytes request_verb request_url request_proto user_agent ssl_cipher ssl_protocol target_group_arn trace_id domain_name chosen_cert_arn matched_rule_priority request_creation_time actions_executed redirect_url lambda_error_reason target_port_list target_status_code_list classification classification_reason
1 h2 2022-04-05T12:29:35.833859Z app/ProdV2PublicApi/0a57bb243c2024e7 REDACTED 60232 10.42.17.243 18000 0.001 -1.0 -1.0 504 - 307 202 PUT https://apibeta.centralapp.com:443/api/v2/inbox/table-booking/request-status?urn=urn%3Acentralapp%3Acompany%3A9211291968452165724 HTTP/2.0 Mozilla/5.0 (Windows NT 6.0; rv:52.0) Gecko/20100101 Firefox/52.0 ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-west-1:899041400740:targetgroup/ProdV2-Quasar-ECS/3ee8cde1808fdb22 Root=1-624c35f3-5852eff9590d1fd5285bc6e7 apibeta.centralapp.com arn:aws:acm:eu-west-1:899041400740:certificate/7f1122d9-2470-46d6-b28b-53a48c99bdd1 5 2022-04-05T12:28:35.831000Z waf,forward - - 10.42.17.243:18000 - - -
From the backend logs, it seems to us that the request never really gets to the ECS service, or from the ECS service's standpoint, the connection gets prematurely terminated by the ALB.
For the premature termination, we see logs in our backend like these:
[CRITICAL] [05/Apr/2022:10:18:47 +0000] ["Quasar/###WARP###"] Exception on request:Just Request
{ requestMethod = "POST"
, httpVersion = HTTP/1.1
, rawPathInfo = "/api/v2/distributor/companies/d-sales/leads"
, rawQueryString = "?id=centralapp_saas&page=0"
, requestHeaders =
[
( "X-Forwarded-For"
, "217.111.215.151"
)
,
( "x-amzn-tls-version"
, "TLSv1.2"
)
,
( "x-amzn-tls-cipher-suite"
, "ECDHE-RSA-AES128-GCM-SHA256"
)
,
( "X-Forwarded-Proto"
, "https"
)
,
( "X-Forwarded-Port"
, "443"
)
,
( "Host"
, "apibeta.centralapp.com"
)
,
( "X-Amzn-Trace-Id"
, "Root=1-624c1784-223215ba597701b31758046f"
)
,
( "Content-Length"
, "40"
)
,
( "sec-ch-ua"
, "" Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99""
)
,
( "content-type"
, "application/json; charset=UTF-8"
)
,
( "sec-ch-ua-mobile"
, "?0"
)
,
( "user-agent"
, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36"
)
,
( "sec-ch-ua-platform"
, ""macOS""
)
,
( "accept"
, "*/*"
)
,
( "origin"
, "https://beta.centralapp.com"
)
,
( "sec-fetch-site"
, "same-site"
)
,
( "sec-fetch-mode"
, "cors"
)
,
( "sec-fetch-dest"
, "empty"
)
,
( "referer"
, "https://beta.centralapp.com/"
)
,
( "accept-encoding"
, "gzip, deflate, br"
)
,
( "accept-language"
, "fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7"
)
,
( "x-amzn-waf-name"
, "ACL_Public_Resources_Default"
)
]
, isSecure = False
, remoteHost = 10.42.17.44:12692
, pathInfo =
[ "api"
, "v2"
, "distributor"
, "companies"
, "d-sales"
, "leads"
]
, queryString =
[
( "id"
, Just "centralapp_saas"
)
,
( "page"
, Just "0"
)
]
, requestBody = <IO ByteString>
, vault = <Vault>
, requestBodyLength = KnownLength 40
, requestHeaderHost = Just "apibeta.centralapp.com"
, requestHeaderRange = Nothing
}
Exception was:Warp: Client closed connection prematurely
We've also tried setting things up without an ALB and we've not been able to reproduce this issue locally, so we're pretty suspicious of what the ALB is doing here.
What should be our next course of action on this issue?
We're also noticing that the ALB tries to keep some connections open longer than the timeout:
[DEBUG] [05/Apr/2022:12:58:32 +0000] ["Quasar/###WARP###"] Socket: 10.42.17.105:39070 was open for: 90.897913261s
[DEBUG] [05/Apr/2022:12:58:35 +0000] ["Quasar/###WARP###"] Socket: 10.42.2.83:56654 was open for: 107.734655922s
[DEBUG] [05/Apr/2022:12:58:38 +0000] ["Quasar/###WARP###"] Socket: 10.42.2.83:56552 was open for: 129.749085682s
[DEBUG] [05/Apr/2022:13:14:17 +0000] ["Quasar/###WARP###"] Socket: 10.42.17.105:41128 was open for: 625.691958838s
Why?