Timeouts on reverse proxy after enabling DNS hostnames

0

We are running an nginx reverse proxy on an EC2 instance in a public subnet with a public IP address and an multiple external DNS records like api.example.com (our app) and elastic.example.com (Kibana) pointing at it. Nginx is configured to proxy_pass requests with the appropriate server_name to various private IP addresses on private subnets. This all works fine.

Yesterday, we turned on the “Enable DNS hostnames” setting on the VPC containing all of our EC2 instances, and additionally created a private Route 53 hosted zone and added a record so that an Elasticsearch cluster could be accessed internally under the name “elastic.example.internal”, so we don’t need to maintain the IP addresses of the Elasticsearch hosts on various instances that use this service. This internal access also worked fine. After around 24-48 hours however, the requests to api.example.com and elastic.example.com started failing sometimes, but not all the time, with 503 gateway timeout errors. The experience was just of extremely slow browsing with frequent timeouts. The nginx access.log showed it was returning these 503 errors, and the error.log showed:

2024/02/08 10:43:04 [error] 417483#417483: *120 upstream timed out (110: Connection timed out) while connecting to upstream, client: 176.132.61.88, server: elastic.example.com, request: "GET /70088/bundles/kbn-ui-shared-deps-src/kbn-ui-shared-deps-src.css HTTP/1.1", upstream: "http://10.0.1.137:5601/70088/bundles/kbn-ui-shared-deps-src/kbn-ui-shared-deps-src.css", host: "elastic.example.com", referrer: "https://elastic.example.com/login?next=%2F"

We tried flushing systemd-resolvd, restarting nginx and restarting kibana, which didn’t help. Disabling “Enable DNS hostnames” soon resolved the issue. Why was this issue occurring when nginx was configured to pass requests to these hosts by their IP addresses, not hostnames? Was the internal hosted zone somehow conflicting with the Amazon-provided private DNS hostnames? Is it because the EC2 instances already existed when the setting was enabled? How can we enable DNS hostnames without causing timeouts?

3 Answers
0
Accepted Answer

This problem was eventually traced to net.netfilter.nf_conntrack_max being set too low on the reverse proxy. The problem was identified by inspecting dmesg logs, which showed netfilter dropping packets. Running sudo sysctl -w net.netfilter.nf_conntrack_max=131072 resolved the issue.

strophy
answered 2 months ago
0

What private zone did you create? Was it an amazonaws phz?

profile picture
EXPERT
answered 3 months ago
0

The private hosted zone was named example.internal where example is our company name. It contained the automatically generated SOA and NS records, as well as two A records with multivalue answer routing policy created by us, named elastic.example.internal.

strophy
answered 3 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions