- Newest
- Most votes
- Most comments
After opening a few tickets with AWS Support, we haven't received any substantial guidance on how to fix this issue. Unfortunately, at this time, the only thing we found to be helpful was to reboot the entire cluster. Once all of your instances have been stopped, you can start the cluster back up and the instances will reclaim their provisioned FreeLocalStorage. Unfortunately, they will continue to lose FreeLocalStorage until they are rebooted again, or fixed by AWS, but it seems to be the only option once your instances have completely ran out of FreeLocalStorage. Unfortunately, this doesn't seem possible instance by instance since the containers will continue to run, which is where the FreeLocalStorage issue is, so minimizing downtime while rebooting seems unlikely.
After completing this, despite AWS's initiative to provide additional garbage collection and their claim that no action is required by their customers, in this image, you can see 3 of our instances remaining at 0 FreeLocalStorage for many hours until the cluster was rebooted manually.
You will also be able to see the downward slope indicating the FreeLocalStorage will run again, after the reboots at 16:45, ignoring the red line since that instance was rebooted manually, again, to install an OS update to test if it fixed anything (it didn't)
Edit: another strategy, to minimize downtime, would be to add replicas to replace the active replicas and complete a failover to hopefully convert your write instance to one of the newly-spun-up replicas (it should prioritize them). Then, recycle all of your older replicas which have no more FreeLocalStorage by deleting them from your cluster. This, of course, only works if you have properly architected your application to support failovers.
For me, it worked to replace the instance which had run out of local storage by adding a new instance in the cluster, and then removing the old one. My clusters are small single-instance clusters though. I don't know if that makes any difference.
Interesting! I am skeptical. The result of spinning up a new instance is likely the same result as stopping and starting the instances. They will both have their full provision FreeLocalStorage. However, I imagine the new instance you created, if using the same engine version, is experiencing the same rate of decay of FreeLocalStorage we have noticed. If you look at the FreeLocalStorage within AWS Metric and look at the rate of the instance's FreeLocalStorage, I hypothesis that it will be negative, indicating that it is decaying over time. Hopefully AWS fixes the issue before that issue affects you, if at all!
What I suggested was only a workaround for clearing up the local storage which at least should be less disruptive than stopping the cluster entirely and starting it back up again. It doesn't fix the underlying issue, but at least AWS are working on rolling out a fix for that.
If the amount of working memory needed for sort or index-creation operations exceeds the amount allocated by the work_mem parameter, Aurora PostgreSQL writes the excess data to temporary disk files. When it writes the data, Aurora PostgreSQL uses the same storage space that it uses for storing error and message logs, that is, local storage. Each instance in your Aurora PostgreSQL DB cluster has an amount of local storage available. The amount of storage is based on its DB instance class. To increase the amount of local storage, you need to modify the instance to use a larger DB instance class.
I don't think that applies here, because the output from
pg_stat_database
shows one of the databases only having used temporary files twice in the ~2 weeks since the upgrade for a total of 2,609 bytes, yet it is still running out of local disk space.If I'll have to upgrade the instance size in order to overcome this issue, I guess it effectively means Aurora PostgreSQL no longer works with the db.t4g.medium instances since version 16.3, which doesn't seem ideal.
I'm tempted to try to start up a new, clean database to see if it exhibits the same problem, but it would take some time before I could see if it is running out of disk space.
We are also experiencing this - all 16.3 instances, regardless of traffic, replication, or anything we can figure would affect it.
Relevant content
- asked 2 years ago
- asked 4 years ago
- asked 3 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 2 years ago
I tried creating new instances in the affected clusters and removing the existing instances. That at least cleared the local storage, but a day later I can see the available local storage space still is dropping at a steady rate, so this is only a work around that will keep things running for 10-12 days, before I have to do it again.
I'm hoping AWS can make a fix for this soon.
We are also experiencing this issue after upgrading to 16.3. All databases regardless of instance type and workload experience the same pace of depleting local storage. AWS Support is currently looking into it with their backend team.
We are also seeing this behaviour on all our 16.3 clusters. They were previously on 15.x and 16.1 and did not display this problem. We have cleared logs and WAL settings have not changed. It is occurring on all our burstable instances at a similar rate. It appears at the moment that our newest cluster on 16.3 on r6g instances does not have the same problem.
Got a response from our AWS Technical Account Manager that they have identified an issue with Aurora PostgreSQL versions 16.3, 15.7, 14.12, 13.15, and 12.19 that causes depletion of local storage. The issue is caused by a service log file that is not correctly garbage collected. They are in the process of deploying a fix to the issue which will be applied in the standard maintenance window. The timing of the fix is dependent on region and the maintenance window for individual clusters. They anticipate completing this process for all regions and clusters by September 13, 2024.
If you check your AWS Health Dashboard, you may see an open issue concerning "RDS operation issue" which will contain some information regarding the issue and detailing their workaround. However, our FreeLocalStorage metrics looks very similar to yours and is causing our database to be unusable. Restarting the instances within the cluster seems to temporarily resolve it is not a suitable fix. We have reached out to support to see what they can advise but, if anyone else determine a fix or workaround, please share! Our only other option would be to painfully downgrade as 9/13 is too far away