By using AWS re:Post, you agree to the Terms of Use
/Intermittent name resolution error when accessing RDS db/

Intermittent name resolution error when accessing RDS db

0

At varying intervals (every couple days, weeks, or even months), the web application hosted on our AWS EC2 instance cannot connect to our RDS database due to a name resolution error: "SQLSTATEHY000 php_network_getaddresses: getaddrinfo failed: Temporary failure in name resolution"

Upon reboot of the EC2 instance, the problem goes away for awhile but then returns days, weeks or months later. Would love to know how to prevent the problem from recurring.

3 Answers
1

How many connections per second are being made from the EC2 instance to the RDS instance? You may be running into a DNS quota issue where the single EC2 instance is sending too many packets too quickly to resolve the RDS DNS entry.

Info about this: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html#vpc-dns-limits

answered 2 months ago
0

Interesting - I am also seeing very occasional (but transitory) rejections from a RDS instance that is only contacted once per minute! The Airflow log looks like ...

2022-03-12 03:05:03,760 {{bash.py:173}} INFO - psql: could not connect to server: Connection refused 2022-03-12 03:05:03,952 {{bash.py:173}} INFO - Is the server running on host "edmdbc01.crqgpecn5ugy.ap-southeast-2.redshift.amazonaws.com" (10.160.224.71) and accepting 2022-03-12 03:05:04,071 {{bash.py:173}} INFO - TCP/IP connections on port 5432? 2022-03-12 03:05:04,148 {{bash.py:173}} INFO - AWS Redshift connection is broken ! 2022-03-12 03:05:04,184 {{bash.py:173}} INFO - PGHOST=edmdbc01.crqgpecn5ugy.ap-southeast-2.redshift.amazonaws.com 2022-03-12 03:05:04,217 {{bash.py:173}} INFO - PGPORT=5432 2022-03-12 03:05:04,249 {{bash.py:173}} INFO - PGDATABASE=staging 2022-03-12 03:05:04,286 {{bash.py:173}} INFO - PGUSER=airflow_user 2022-03-12 03:05:04,320 {{bash.py:173}} INFO - PGPASS=/tmp/airflowtmp868/dev/.pgpass 2022-03-12 03:05:04,356 {{bash.py:173}} INFO - PGOPTIONS=--search_path=model

Whilst Route 53 can burst limit - I would (naively?) assume if the webapp was hitting the same endpoint (RDS) and the DNS TTL wasn't 0 (disabled) that the EC2 name cache would prevent reissue of the name resolutions until it reached expiry. Perhaps the webapp is requesting lots of other novel names and the RDS request is just the unlucky 1001 th ?

answered 2 months ago
0

For something completely different ... I have just noticed a pattern emerging in our own Redshift connection failures!

These have all occurred between 03:01 UTC and 03:20 UTC - with the majority of cases seeing service restored after one minute - however on two occasions as long as 4 minutes later!!

This suggests (weekly?) routine maintenance is in progress and the instance is being killed/restarted - or at least going into a single user mode and hence refusing connections!

In that case restarting the EC2 probably induces enough delay during the reboot so that by the time it is back on-line the Redshift maintenance is complete - which in turn means the reboot is not actually fixing anything. What happens if you leave it online? Does it ever reconnect without intervention ?

answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions