Aurora Serverless (postgres) v1 down to 0 not scaling back up properly

0

Update: Turns out it is the call to SSM (Parameter Store) to get DB info that is timing out, so this is not actually related to Aurora Serverless.

Our staging environment is setup to scale down to 0 when not in use. This has been working great for a few years now until recently.
I only noticed this last week, but now we're getting inconsistent timeouts when it scales back up. Since we're using postgres, and the minimum is 2 capacity units, it almost feels like one unit boots up, and the other doesn't. Not sure if that's true, or there's something else weird about the scaling?

Steps to reproduce:

  • Load a page that requests something from the Aurora cluster that is scaled to 0
  • Wait for 30-60 seconds for the cluster to start
  • Reload the page

Result:

  • When I reload the page that hits the DB, it sometimes works, and sometimes doesn't. When it doesn't work, it times out (exactly as it did the first time). There's no pattern, and I haven't yet found a way (even after a few minutes of trying) to make it consistently stay up.

This makes CI (especially migrations) and smoke testing on staging painful, where i have to re-run the action several times before it succeeds (it eventually randomly hits the DB while it says it's up). This is not an issue on our production instance which doesn't scale down to 0.

Both instance are using:

  • Aurora Serverless v1
  • Postgres 11.18

Staging Scaling config:

  • Autoscaling timeout: 5 minutes
  • Pause compute capacity after consecutive minutes of inactivity: 10 minutes
  • Minimum Aurora capacity units: 2 capacity units
  • Maximum Aurora capacity units: 2 capacity units
  • Force scaling the capacity to the specified values when the timeout is reached: Not enabled
asked 9 months ago356 views
1 Answer
0
Accepted Answer

Hi,

You stated "minimum is 2 nodes" in question. Please note that "node" is not an applicable term for Aurora Serverless v1. Could you confirm what you meant was "2 capacity units"?

Are you able to reproduce the problem you described using the built-in query editor on the RDS console?

AWS
Aslan G
answered 9 months ago
  • Indeed, I meant 2 capacity units, will update the question to reflect this. And good one to test, I cannot reproduce this on the query editor, but it is reproducible on a minimal lambda (behind APIGW) that simply connects to the database and queries a single item. The 30s APIGW timeout is hit inconsistently, but very often. This doesn't reproduce on the same code in production, and the only infrastructural difference is the database configuration.

  • after a few days of investigation, it looks like it's a call to SSM to get database information from Parameter Store that's timing out. Since it has nothing to do with Aurora, I will accept this as the answer and continue searching, maybe open an issue related to SSM.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions