Aurora Serverless V2 (PostgreSQL) gets stuck

0

We've PostgreSQL Aurora Serverless V2 cluster with 2 instances (1 reader and 1 writer instance). They are configured to have 2-16 ACUs.

When we populate data into the DB, either instance or both instances get stuck. PostgreSQL connections are timing out without any errors. Stuck instance looks healthy in AWS console but AWS doesn't get any metrics from the instance. Only instance reboot fixes it temporarily, but it takes a long time (30-60 min).

Error logs contain the following repeated (every 5-10 sec) line:

2022-06-20 06:01:06 UTC::@:[31477]:WARNING: worker took too long to start; canceled
2022-06-20 06:01:11 UTC::@:[31477]:WARNING: worker took too long to start; canceled

We've checked all instance metrics before the freeze but we've seen nothing interesting.

The load is quite heavy when we caused freezes of the database. In the first run we used 1,000 lambdas concurrently to insert data. It was working fine until writer instance got stuck. Later we've used only 100 lambdas in parallel. Writer instance didn't have any issues but the reader instance got stuck.

We chose the new Aurora Serverless V2 to be our production database. Currently it feels too unstable for us and we are considering migration to some more mature service.

  • Get any resolution to this? I am experiencing the exact same problem - no indication of anything going off the rails (Memory, CPU all fine), but a completely unresponsive Writer and "WARNING: worker took too long to start; canceled" over and over in the logs. This seems like an AWS Aurora issue that we shouldn't have to pay AWS to support since we are already paying for the service and there is clearly something wrong with it.

tero
asked 2 years ago1186 views
1 Answer
0

Can you please post the ACUs it scaled just before the freeze? It seems like 16 ACU is not enough for your workload and the writer is getting pegged at 100 cpu utilization. Have you tried with a higher min and max ACU setting e.g. 4-32 and see what the results are. Since you are in the testing phase. set a higher max ACU and check to what ACU V2 scales with your workload.

Open a ticket with support as well.

AWS
answered 2 years ago
  • ACU usage before the latest freeze of the reader instance was 10 and CPU usage 5-10 %. Those numbers were quite stable during the 6 hours before the issue. Before that there was 100 % CPU usage and 16 ACU usage for 20 hours.

    I haven't tested higher ACU settings because I want to limit maximum cost of RDS. I'd still expect database service be stable even when moderate CPU and memory allocations are throttling the performance.

    Unfortunately I don't have technical support available on that AWS account, so I can't post proper support ticket.

    However, I think Serverless V2 was initially wrong choice to our use-case due to challenges with combination of costs, index size and performance after down-scaled ACUs.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions