Aurora Serverless V2 (PostgreSQL) gets stuck
We've PostgreSQL Aurora Serverless V2 cluster with 2 instances (1 reader and 1 writer instance). They are configured to have 2-16 ACUs.
When we populate data into the DB, either instance or both instances get stuck. PostgreSQL connections are timing out without any errors. Stuck instance looks healthy in AWS console but AWS doesn't get any metrics from the instance. Only instance reboot fixes it temporarily, but it takes a long time (30-60 min).
Error logs contain the following repeated (every 5-10 sec) line:
2022-06-20 06:01:06 UTC::@::WARNING: worker took too long to start; canceled 2022-06-20 06:01:11 UTC::@::WARNING: worker took too long to start; canceled
We've checked all instance metrics before the freeze but we've seen nothing interesting.
The load is quite heavy when we caused freezes of the database. In the first run we used 1,000 lambdas concurrently to insert data. It was working fine until writer instance got stuck. Later we've used only 100 lambdas in parallel. Writer instance didn't have any issues but the reader instance got stuck.
We chose the new Aurora Serverless V2 to be our production database. Currently it feels too unstable for us and we are considering migration to some more mature service.
Get any resolution to this? I am experiencing the exact same problem - no indication of anything going off the rails (Memory, CPU all fine), but a completely unresponsive Writer and "WARNING: worker took too long to start; canceled" over and over in the logs. This seems like an AWS Aurora issue that we shouldn't have to pay AWS to support since we are already paying for the service and there is clearly something wrong with it.
Can you please post the ACUs it scaled just before the freeze? It seems like 16 ACU is not enough for your workload and the writer is getting pegged at 100 cpu utilization. Have you tried with a higher min and max ACU setting e.g. 4-32 and see what the results are. Since you are in the testing phase. set a higher max ACU and check to what ACU V2 scales with your workload.
Open a ticket with support as well.
ACU usage before the latest freeze of the reader instance was 10 and CPU usage 5-10 %. Those numbers were quite stable during the 6 hours before the issue. Before that there was 100 % CPU usage and 16 ACU usage for 20 hours.
I haven't tested higher ACU settings because I want to limit maximum cost of RDS. I'd still expect database service be stable even when moderate CPU and memory allocations are throttling the performance.
Unfortunately I don't have technical support available on that AWS account, so I can't post proper support ticket.
However, I think Serverless V2 was initially wrong choice to our use-case due to challenges with combination of costs, index size and performance after down-scaled ACUs.
Upgrade for Amazon Aurora Serverless v1 PostgreSQL-compatible edition 10.x end of support is January 31, 2023asked 2 months ago
Aurora Serverless (PostgreSQL) autoscaling errorasked 6 months ago
Connecting to Aurora PostgreSQL serverless v2 from lambdaasked 4 days ago
Recommended way to interact with Aurora Serverless v2 without public accessasked 8 days ago
Aurora Serverless V2 (PostgreSQL) gets stuckasked 13 days ago
Aurora Serverless v2 without VPCasked 2 months ago
Aurora PostgreSQL write forwardingAccepted Answerasked 2 years ago
CRITICAL: Aurora Serverless V2 writer node failingasked 20 days ago
(Serverless) Amazon Aurora MySQL 1 end of life on February 28, 2023asked 3 months ago
Move RDS postgresql database to Aurora ServerlessAccepted Answerasked 3 years ago