Intermittent name resolution error when accessing RDS db

0

At varying intervals (every couple days, weeks, or even months), the web application hosted on our AWS EC2 instance cannot connect to our RDS database due to a name resolution error: "SQLSTATE[HY000] [2002] php_network_getaddresses: getaddrinfo failed: Temporary failure in name resolution"

Upon reboot of the EC2 instance, the problem goes away for awhile but then returns days, weeks or months later. Would love to know how to prevent the problem from recurring.

4 Answers
1

How many connections per second are being made from the EC2 instance to the RDS instance? You may be running into a DNS quota issue where the single EC2 instance is sending too many packets too quickly to resolve the RDS DNS entry.

Info about this: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html#vpc-dns-limits

profile pictureAWS
EXPERT
Chris_G
answered 2 years ago
0

Interesting - I am also seeing very occasional (but transitory) rejections from a RDS instance that is only contacted once per minute! The Airflow log looks like ...

[2022-03-12 03:05:03,760] {{bash.py:173}} INFO - psql: could not connect to server: Connection refused [2022-03-12 03:05:03,952] {{bash.py:173}} INFO - Is the server running on host "edmdbc01.crqgpecn5ugy.ap-southeast-2.redshift.amazonaws.com" (10.160.224.71) and accepting [2022-03-12 03:05:04,071] {{bash.py:173}} INFO - TCP/IP connections on port 5432? [2022-03-12 03:05:04,148] {{bash.py:173}} INFO - AWS Redshift connection is broken ! [2022-03-12 03:05:04,184] {{bash.py:173}} INFO - PGHOST=edmdbc01.crqgpecn5ugy.ap-southeast-2.redshift.amazonaws.com [2022-03-12 03:05:04,217] {{bash.py:173}} INFO - PGPORT=5432 [2022-03-12 03:05:04,249] {{bash.py:173}} INFO - PGDATABASE=staging [2022-03-12 03:05:04,286] {{bash.py:173}} INFO - PGUSER=airflow_user [2022-03-12 03:05:04,320] {{bash.py:173}} INFO - PGPASS=/tmp/airflowtmp868/dev/.pgpass [2022-03-12 03:05:04,356] {{bash.py:173}} INFO - PGOPTIONS=--search_path=model

Whilst Route 53 can burst limit - I would (naively?) assume if the webapp was hitting the same endpoint (RDS) and the DNS TTL wasn't 0 (disabled) that the EC2 name cache would prevent reissue of the name resolutions until it reached expiry. Perhaps the webapp is requesting lots of other novel names and the RDS request is just the unlucky 1001 th ?

answered 2 years ago
0

For something completely different ... I have just noticed a pattern emerging in our own Redshift connection failures!

These have all occurred between 03:01 UTC and 03:20 UTC - with the majority of cases seeing service restored after one minute - however on two occasions as long as 4 minutes later!!

This suggests (weekly?) routine maintenance is in progress and the instance is being killed/restarted - or at least going into a single user mode and hence refusing connections!

In that case restarting the EC2 probably induces enough delay during the reboot so that by the time it is back on-line the Redshift maintenance is complete - which in turn means the reboot is not actually fixing anything. What happens if you leave it online? Does it ever reconnect without intervention ?

answered 2 years ago
0

As Chris_G mentioned:

You may be running into a DNS quota issue where the single EC2 instance is sending too many packets too quickly to resolve the RDS DNS entry

I think it's worth adding to this, that the metric used for the hard limit and maximum number of DNS query packets is packets per second, per ENI. So while your webapp may not be querying the DB's DNS that much, it may be just as you said:

as Perhaps the webapp is requesting lots of other novel names and the RDS request is just the unlucky 1001 th ?

However, I don't believe the names would have to be novel. Unless you have configured local DNS caching on your EC2 instance, DNS queries from your instance will still head to your AWS DNS Server.

I ran into the same issue, but it was with Lambda functions instead of an EC2 instance. We solved the issue but increasing the number of ENIs assigned to the Lambda functions. You could also increase the number of ENIs on your EC2 instance, depending on which instance type you're using.

If you want to dive deeper into finding out where exactly the problem is, check out: How can I determine whether my DNS queries to the Amazon-provided DNS server are failing due to VPC DNS throttling? and then How can I avoid DNS resolution failures with an Amazon EC2 Linux instance?

The following links might also be of use:

Also here's some answers I posted regarding the same issue, but with lambda functions

profile picture
mikey
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions