API Gateway to NLB via VPCLink latency issues

0

Hi there,

I've been working on linking up API Gateway to our ECS services for the last few months and as I've started to near completion I've noticed that network requests through to our ECS containers are taking a really long time (around 10000 ms). I've been testing this with a simple GET request into out containers and seeing that the latency seems to occur at the request at the Network Load Balancer.

Execution log for request c4cdc901-4641-11e9-ae0f-7ba6f11e24e4
Thu Mar 14 10:13:10 UTC 2019 : Starting execution for request: c4cdc901-4641-11e9-ae0f-7ba6f11e24e4
Thu Mar 14 10:13:10 UTC 2019 : HTTP Method: GET, Resource Path: /app1/heartbeat
Thu Mar 14 10:13:10 UTC 2019 : Method request path: {proxy=heartbeat}
Thu Mar 14 10:13:10 UTC 2019 : Method request query string: {}
Thu Mar 14 10:13:10 UTC 2019 : Method request headers: {}
Thu Mar 14 10:13:10 UTC 2019 : Method request body before transformations:
Thu Mar 14 10:13:10 UTC 2019 : Endpoint request URI: http://nlb-bc8abe52e89e8e17.elb.eu-west-1.amazonaws.com:9004/heartbeat
Thu Mar 14 10:13:10 UTC 2019 : Endpoint request headers: {x-amzn-apigateway-api-id=<redacted>, User-Agent=AmazonAPIGateway_<redacted>, Host=nlb<redacted>.elb.eu-west-1.amazonaws.com}
Thu Mar 14 10:13:10 UTC 2019 : Endpoint request body after transformations:
Thu Mar 14 10:13:10 UTC 2019 : Sending request to http://nlb<redacted>.elb.eu-west-1.amazonaws.com:9004/heartbeat
Thu Mar 14 10:13:20 UTC 2019 : Received response. Integration latency: 10135 ms
Thu Mar 14 10:13:20 UTC 2019 : Endpoint response body before transformations: {"err":false,"output":"Alive"}
Thu Mar 14 10:13:20 UTC 2019 : Endpoint response headers: {X-Powered-By=Express, Access-Control-Allow-Origin=, Content-Type=application/json; charset=utf-8, Content-Length=30, ETag=W/"1e-3OO8BxIRzGH40uT/pZj6IlqId5s", Date=Thu, 14 Mar 2019 10:13:20 GMT, Connection=keep-alive}
Thu Mar 14 10:13:20 UTC 2019 : Method response body after transformations: {"err":false,"output":"Alive"}
Thu Mar 14 10:13:20 UTC 2019 : Method response headers: {X-Powered-By=Express, Access-Control-Allow-Origin=
, Content-Type=application/json; charset=utf-8, Content-Length=30, ETag=W/"1e-3OO8BxIRzGH40uT/pZj6IlqId5s", Date=Thu, 14 Mar 2019 10:13:20 GMT, Connection=keep-alive}
Thu Mar 14 10:13:20 UTC 2019 : Successfully completed execution
Thu Mar 14 10:13:20 UTC 2019 : Method completed with status: 200

When I attempt this request in the test section of API Gateway This is the standard response. Generally with a latency varying from 5 seconds to at one point 20 seconds. Occasionally I do get through quickly:

Execution log for request d8775b9d-4641-11e9-b83c-7fdec07590a2
Thu Mar 14 10:13:43 UTC 2019 : Starting execution for request: d8775b9d-4641-11e9-b83c-7fdec07590a2
Thu Mar 14 10:13:43 UTC 2019 : HTTP Method: GET, Resource Path: /app1/heartbeat
Thu Mar 14 10:13:43 UTC 2019 : Method request path: {proxy=heartbeat}
Thu Mar 14 10:13:43 UTC 2019 : Method request query string: {}
Thu Mar 14 10:13:43 UTC 2019 : Method request headers: {}
Thu Mar 14 10:13:43 UTC 2019 : Method request body before transformations:
Thu Mar 14 10:13:43 UTC 2019 : Endpoint request URI: http://nlb<redacted>.elb.eu-west-1.amazonaws.com:9004/heartbeat
Thu Mar 14 10:13:43 UTC 2019 : Endpoint request headers: {x-amzn-apigateway-api-id=<redacted>, User-Agent=AmazonAPIGateway_<redacted>, Host=nlb<redacted>.elb.eu-west-1.amazonaws.com}
Thu Mar 14 10:13:43 UTC 2019 : Endpoint request body after transformations:
Thu Mar 14 10:13:43 UTC 2019 : Sending request to http://nlb<redacted>.elb.eu-west-1.amazonaws.com:9004/heartbeat
Thu Mar 14 10:13:43 UTC 2019 : Received response. Integration latency: 10 ms
Thu Mar 14 10:13:43 UTC 2019 : Endpoint response body before transformations: {"err":false,"output":"Alive"}
Thu Mar 14 10:13:43 UTC 2019 : Endpoint response headers: {X-Powered-By=Express, Access-Control-Allow-Origin=, Content-Type=application/json; charset=utf-8, Content-Length=30, ETag=W/"1e-3OO8BxIRzGH40uT/pZj6IlqId5s", Date=Thu, 14 Mar 2019 10:13:43 GMT, Connection=keep-alive}
Thu Mar 14 10:13:43 UTC 2019 : Method response body after transformations: {"err":false,"output":"Alive"}
Thu Mar 14 10:13:43 UTC 2019 : Method response headers: {X-Powered-By=Express, Access-Control-Allow-Origin=
, Content-Type=application/json; charset=utf-8, Content-Length=30, ETag=W/"1e-3OO8BxIRzGH40uT/pZj6IlqId5s", Date=Thu, 14 Mar 2019 10:13:43 GMT, Connection=keep-alive}
Thu Mar 14 10:13:43 UTC 2019 : Successfully completed execution
Thu Mar 14 10:13:43 UTC 2019 : Method completed with status: 200

But these are far and few between, requests to the ECS container directly are always under 100ms and I can see no latency on it's side. I have also had a look at the health checks and the host has been healthy consistently since it started up. I also ran a couple of artillery tests to the endpoint and the got the following results for 50 requests per second for one minute:

All virtual users finished
Summary report @ 10:50:10(+0000) 2019-03-14
Scenarios launched: 3000
Scenarios completed: 3000
Requests completed: 3000
RPS sent: 42.6
Request latency:
min: 227.9
max: 10990.1
median: 366.5
p95: 9892.2
p99: 10328.3
Scenario counts:
0: 3000 (100%)
Codes:
200: 3000

When I drop the frequency of requests down to 1 request per second for 5 minutes I see a large rise in the median request latency:

All virtual users finished
Summary report @ 10:39:13(+0000) 2019-03-14
Scenarios launched: 300
Scenarios completed: 300
Requests completed: 300
RPS sent: 0.97
Request latency:
min: 213.8
max: 10286.3
median: 4882.1
p95: 10077.8
p99: 10188.8
Scenario counts:
0: 300 (100%)
Codes:
200: 300

This rise in median makes me think that AWS are reusing resources between frequent requests but I still see a large 10 second delay on occasional requests which to me seems overly long. It makes the frontends feel clunky and under-optimised and it's incredibly disheartening.

Has anyone ever came across issues like this? I have looked and I believe I can attribute around 500-1000 ms to the API gateway Custom Domain Name distribution for reusing TLS connections. Apart from that I am a little stumped as to what can account for 10 seconds per request.

asked 5 years ago1611 views
2 Answers
0

Hello:

Please see this forum thread for a common cause of the 5s-10s delay you are seeing:

https://forums.aws.amazon.com/thread.jspa?messageID=871957&#871957

Regards,
Bob

EXPERT
answered 5 years ago
0

Hi there Bob,

This was exactly the answer I needed. I've resolved the issue by enabling Cross Zone Load Balancing onto the Network Load Balancer. I checked my health checks and they seem fine and I cannot reduce the zones that the NLB is provisioned in as the target group it is targeting is auto scaling across all 3 Availability Zones. Whilst testing I only had 1 container running. I believe there is no way to adjust the health checks to tell API Gateway to only go the AZ with a healthy target present? As this would remove the need for Cross Zone Load Balancing. But with it enabled I'm consistently getting sub 30 ms.

Thank you!!

answered 5 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions