Good day.
I have a problem after adding RDS Proxy to our service, "SQLSTATE[HY000] [2002] Connection timed out" started to appear. Though they are pretty infrequent (maybe 1/100,000 of connections or even less fail), the problem quickly became pretty annoying. Sometimes there is one or two per hour, sometimes none. Looks like the amount of errors correlates a little with the amount of connections made (it is hard to tell though).
Some info about the service:
- Aurora MySQL 3.03.1, 3AZ, 1 writer + 2 readers.
- PHP 8.1 + PDO extension (maybe that is a known issue that the advanced driver for Java solves?), connection attempt is stopped after about 30 seconds.
MySQL non-default parameters:
- sql_mode = TRADITIONAL
- max_connections = 500 (was useful before the proxy was introduced)
- long_query_time = 30
- binlog_format = MIXED (for blue-green deployments only)
RDS Metrics:
- AbortedClients: there are some, tens of such failures per day, but significantly more than 2-5 errors I see.
- ActiveTransactions: 0
- AuroraSlowConnectionHandleCount: 1.5k+ is added per day
- AuroraSlowHandshakeCount: 2-3 is added per day
RDS Proxy metrics are all green (no errors, connection pins, etc.). I've tried to enable advanced logging, there was not much to gain. Could see only that some connections were pinned due to the query size, but it was relatively rare, some per minute.
Thanks for any help.
If there are any folks that stumbled upon the same issue, feel free to complain here.
We also have started seeing the same problem this morning also.
Our production workload is impacted and the resolution was to not use proxy - however, as we run high data loads using Lambda overnight, this mitigation is very temporary.
I have been consulting with the RDS specialists from AWS business support for around 7 hours today and we have not got to the bottom of the problem yet.
Sup, if anyone is interested I haven't seen those errors since then (no code or infrastructure changes, they just disappeared). But I am not using Lambdas though. If you were able to fix the issue then please respond here, I am just interested to hear what it was.
@jason @adam out of curiosity what region are you seeing these issues in? I'm seeing them in ap-southeast-2
@luke yes we're also ap-southeast-2, glad that it's not just us then! I haven't tried re-introducing the proxy this morning but please let us know if you get to the bottom of the issue or have any updates.
We've had to switch to Aurora Serverless v2 in the meantime while we can't use a proxy to avoid reaching the connection limit of our regular instances.
I am us-west-2 and everything is fine, looks like you are onto something.