CRITICAL: Aurora Serverless V2 writer node failing

1

We run a large eCommerce site in which the production database cluster was recently migrated to Aurora MySQL Serverless V2 from V1.

In the last week and a half, we've observed two identical complete failures of the Writer Serverless V2 node. All attempts to connect to the RDS node fail with the message: "Too many connections" yet prior to the incident, Cloudwatch logs confirm ~30 connections (typical for the time in the night it happens) and as soon as it happens, the connections drop to 0 and remain there.

We have custom parameter groups which set the max_execution_time to 20s, so any long-running SQL/connections will have ended and been closed as seen from the Cloudwatch logs and Audit logs we export to Cloudwatch too.

During the outage, no auto-scaling is occurring and no auto-recovery happens. We have to manually reboot the Writer instance to get the cluster/website back online.

Hoping this will catch the eye of someone from AWS who can look into this ASAP.

1 Answer
0

Hello there,

As you mentioned, the Aurora Serverless v2 instance becomes unresponsive with the error stating "Too many connections" where as no limit is encountered. Each database connection uses resources, namely CPU and memory. If those resources are exhausted, further database connections will not be possible. You could be facing the outage leading to database connections failing due to: CPU and / or memory overload (.i.e. resource saturation) Minimum and Maximum ACU capacity set too low for your workload Parameters which utilize memory too much leading to resource saturation etc.

Please ensure that none of the above is affecting your Aurora Serverless v2 instance. You can further use the below resource that can help mitigate the issue further:

Troubleshooting for Amazon Aurora - https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/CHAP_Troubleshooting.html Performance and scaling for Aurora Serverless v2 - https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless-v2.setting-capacity.html Aurora Serverless v2 blog - https://aws.amazon.com/blogs/aws/amazon-aurora-serverless-v2-is-generally-available-instant-scaling-for-demanding-workloads/

scale the instance up to DB instance class with more memory

answered 2 years ago
  • Hi,

    Thanks for the reply. I can confirm CPU certainly wasn't limited and ACU peaked at 10.5 right before the outage (limit is 128 so plenty of headroom there), Cloudwatch reports 255G freeable memory at the time too (feels way more than I'd expect at 10.5 ACU, could this be a skewed stat due to how Serverless v2 is reporting?).

    Either way, I certainly don't think it's CPU or Memory limited and the ACU had plenty of headroom, so I would have expected things to scale if there was a resource issue.

  • Hi I would suggest open a support case if support plan is available and let support to help investigate from AWS end.

    The error "Too many connections" typically related to: https://dev.mysql.com/doc/refman/5.7/en/too-many-connections.html where max_connections encountered. I wonder what is the max_connections available by running: SHOW VARIABLES LIKE "max_connections";

    Anything else from MySQL error logs?

  • I don't like having to take out AWS Support and pay an extra 10% of our monthly bill just to report what appears to be a critical bug in Serverless V2, but after being affected again by the same issue today, we've now done so - Please see Case ID 10237776351. A swift investigation would be much appreciated!

    Also, SELECT @@max_connections;: 5000

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions