Aurora PostgreSQL 16.3 running out of local disk space

4

We have two Aurora PostgreSQL databases that were upgraded from version 16.2 to version 16.3 on August 9th. Today they both ran out of disk space on the local storage volume where as prior to the upgrade the output of the "FreeLocalStorage" metric was stable. Both the databases have a single db.t4g.medium instance.

FreeLocalStorage metric

There hasn't been any change to access patterns to the databases in the last two weeks which can explain this, and both the databases have barely any load, hence the small instance sizes used. They are both using the default "default.aurora-postgresql16" database instance and cluster parameter groups.

I have tried to follow https://repost.aws/knowledge-center/postgresql-aurora-storage-issue which relates to running out of local disk space, and while one of the databases show use of temporary files:

postgres=> SELECT datname, temp_files, pg_size_pretty(temp_bytes) AS temp_file_size FROM pg_stat_database ORDER BY temp_bytes DESC;
   datname    | temp_files | temp_file_size
--------------+------------+----------------
 my_database  |      38475 | 180 GB
 rdsadmin     |          4 | 27 kB
              |          0 | 0 bytes
 template0    |          0 | 0 bytes
 template1    |          0 | 0 bytes
 postgres     |          0 | 0 bytes
(6 rows)

the other one does not:

postgres=> SELECT datname, temp_files, pg_size_pretty(temp_bytes) AS temp_file_size FROM pg_stat_database ORDER BY temp_bytes DESC;
  datname    | temp_files | temp_file_size
-------------+------------+----------------
 rdsadmin    |          4 | 27 kB
 my_database |          2 | 2609 bytes
             |          0 | 0 bytes
 template0   |          0 | 0 bytes
 template1   |          0 | 0 bytes
 postgres    |          0 | 0 bytes
(6 rows)

Querying for the temporary tables doesn't give any results on either database.

I have tried to reset the PostgreSQL stats with SELECT pg_stat_reset();, which didn't have any impact on the storage usage, neither did restarting one of the database instances.

Does anybody else see this, or does anyone have some explanation as to what has caused this issue? I'm inclined to think it's a buggy Aurora PostgreSQL release, as another non-Aurora RDS PostgreSQL 16.3 database didn't see any issues like this. Unfortunately, it's not possible to downgrade the PostgreSQL version, so if it is I'll have to re-create the databases from a dump or hope a bugfix release is imminent.

  • I tried creating new instances in the affected clusters and removing the existing instances. That at least cleared the local storage, but a day later I can see the available local storage space still is dropping at a steady rate, so this is only a work around that will keep things running for 10-12 days, before I have to do it again.

    I'm hoping AWS can make a fix for this soon.

  • We are also experiencing this issue after upgrading to 16.3. All databases regardless of instance type and workload experience the same pace of depleting local storage. AWS Support is currently looking into it with their backend team.

  • We are also seeing this behaviour on all our 16.3 clusters. They were previously on 15.x and 16.1 and did not display this problem. We have cleared logs and WAL settings have not changed. It is occurring on all our burstable instances at a similar rate. It appears at the moment that our newest cluster on 16.3 on r6g instances does not have the same problem.

  • Got a response from our AWS Technical Account Manager that they have identified an issue with Aurora PostgreSQL versions 16.3, 15.7, 14.12, 13.15, and 12.19 that causes depletion of local storage. The issue is caused by a service log file that is not correctly garbage collected. They are in the process of deploying a fix to the issue which will be applied in the standard maintenance window. The timing of the fix is dependent on region and the maintenance window for individual clusters. They anticipate completing this process for all regions and clusters by September 13, 2024.

  • If you check your AWS Health Dashboard, you may see an open issue concerning "RDS operation issue" which will contain some information regarding the issue and detailing their workaround. However, our FreeLocalStorage metrics looks very similar to yours and is causing our database to be unusable. Restarting the instances within the cluster seems to temporarily resolve it is not a suitable fix. We have reached out to support to see what they can advise but, if anyone else determine a fix or workaround, please share! Our only other option would be to painfully downgrade as 9/13 is too far away

2 Answers
0

After opening a few tickets with AWS Support, we haven't received any substantial guidance on how to fix this issue. Unfortunately, at this time, the only thing we found to be helpful was to reboot the entire cluster. Once all of your instances have been stopped, you can start the cluster back up and the instances will reclaim their provisioned FreeLocalStorage. Unfortunately, they will continue to lose FreeLocalStorage until they are rebooted again, or fixed by AWS, but it seems to be the only option once your instances have completely ran out of FreeLocalStorage. Unfortunately, this doesn't seem possible instance by instance since the containers will continue to run, which is where the FreeLocalStorage issue is, so minimizing downtime while rebooting seems unlikely.

After completing this, despite AWS's initiative to provide additional garbage collection and their claim that no action is required by their customers, in this image, you can see 3 of our instances remaining at 0 FreeLocalStorage for many hours until the cluster was rebooted manually.

Enter image description here

You will also be able to see the downward slope indicating the FreeLocalStorage will run again, after the reboots at 16:45, ignoring the red line since that instance was rebooted manually, again, to install an OS update to test if it fixed anything (it didn't)

Edit: another strategy, to minimize downtime, would be to add replicas to replace the active replicas and complete a failover to hopefully convert your write instance to one of the newly-spun-up replicas (it should prioritize them). Then, recycle all of your older replicas which have no more FreeLocalStorage by deleting them from your cluster. This, of course, only works if you have properly architected your application to support failovers.

Zack
answered 10 days ago
  • For me, it worked to replace the instance which had run out of local storage by adding a new instance in the cluster, and then removing the old one. My clusters are small single-instance clusters though. I don't know if that makes any difference.

  • Interesting! I am skeptical. The result of spinning up a new instance is likely the same result as stopping and starting the instances. They will both have their full provision FreeLocalStorage. However, I imagine the new instance you created, if using the same engine version, is experiencing the same rate of decay of FreeLocalStorage we have noticed. If you look at the FreeLocalStorage within AWS Metric and look at the rate of the instance's FreeLocalStorage, I hypothesis that it will be negative, indicating that it is decaying over time. Hopefully AWS fixes the issue before that issue affects you, if at all!

  • What I suggested was only a workaround for clearing up the local storage which at least should be less disruptive than stopping the cluster entirely and starting it back up again. It doesn't fix the underlying issue, but at least AWS are working on rolling out a fix for that.

-5

If the amount of working memory needed for sort or index-creation operations exceeds the amount allocated by the work_mem parameter, Aurora PostgreSQL writes the excess data to temporary disk files. When it writes the data, Aurora PostgreSQL uses the same storage space that it uses for storing error and message logs, that is, local storage. Each instance in your Aurora PostgreSQL DB cluster has an amount of local storage available. The amount of storage is based on its DB instance class. To increase the amount of local storage, you need to modify the instance to use a larger DB instance class.

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.BestPractices.TroubleshootingStorage.html

profile picture
EXPERT
answered 18 days ago
  • I don't think that applies here, because the output from pg_stat_database shows one of the databases only having used temporary files twice in the ~2 weeks since the upgrade for a total of 2,609 bytes, yet it is still running out of local disk space.

    If I'll have to upgrade the instance size in order to overcome this issue, I guess it effectively means Aurora PostgreSQL no longer works with the db.t4g.medium instances since version 16.3, which doesn't seem ideal.

    I'm tempted to try to start up a new, clean database to see if it exhibits the same problem, but it would take some time before I could see if it is running out of disk space.

  • We are also experiencing this - all 16.3 instances, regardless of traffic, replication, or anything we can figure would affect it.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions