PostgreSQL Read Replica Shows High Read IO Every 15 Minutes

3

We have an RDS PostgreSQL read replica which every 15 minutes exactly spikes to 3000 IOPS for 1-2 minutes. It mysteriously started overnight yesterday with no previous indication. There does not seem to be any corresponding queries or jobs touching the primary database nor the read replica which would be causing this. Manual backups happen once nightly. Replication does not show any current lag. This condition does tend to run us out of EBS burst credits eventually. Rebuilding the read replica results in the same condition.

We're at a loss as to what we should look at in order to ascertain what's causing the issue. pg_stat_activity doesn't show any queries running at that time.

4 Answers
1

I finally got some news from AWS support. They recently applied a patch to RDS instances that seems to be causing the issue, as the problem started after they applied the patch. They also say:

"It appears that this is a known issue that is currently occurring with RDS PostgreSQL for some instance classes.Unfortunately I do not have a specific list of impacted instances as it seems to be an internal issue and I cannot provide you with an ETA for a fix, but I can confirm that the internal team is actively working on this issue and will deploy a fix as soon as possible."

JuanM
answered 2 years ago
  • That's absolutely wonderful news!

  • Thanks for doing this and posting. We're hanging on by a thread here in a couple environments.

  • JuanM, we seem to be seeing relief on our end. Wondered if you were seeing the same?

1

It seems that someone else is having the same issue. We're reluctant to pay AWS Support fees (expensive) which seems to be an AWS issue and on AWS's side of the Shared Responsibility model.

ssmith
answered 2 years ago
0

We are experiencing the same issue here in all our RDS Postgresql databases (6 instances) since 3 days ago.

We have spikes of ReadIOPS every 15 minutes. Other metrics such as ReadLatency and DiskQueueDepth are affected as well. Taking a look at CPU usage it seems it is also affected during ReadIOPS spikes.

I tried rebooting the instance to the secondary zone (multi a-z) but it didn't solve the problem.

I had to increase the storage just to improve recovering BurstBalance in order to avoid an outage due to exhausting the credit.

I reported the issue to AWS support but I still don't have an answer.

JuanM
answered 2 years ago
  • We ended up finding that if we moved our database from a db.t3.medium to a db.t3.large the impact of whatever was going on was reduced. Additionally, moving it to a db.m6g.large eliminated the effects completely. The effects came back when we moved the instance back to a db.t3.medium, which we feel is best suited for the read-replica (and had been serving us just fine for years).

  • can you guys check yours?, I think mine doesn't have it anymore around 5 hours ago. nvm it came back after I change my instance size back

  • @fikrimi, we still have the issue here. @ssmith, my guess about why moving to db.t3.large reduced the impact is that you are doubling the memory size (from 4GiB to 8GiB) so the workload is more likely to be completely in memory now. Whatever the thing is that was reading data from disk every 15 minutes is now reading it from memory, without impacting the ReadIOPS, but I'm pretty sure that If your instance upgrade only reduced the problem, the issue is still there.

  • yep, mine doesn't get read anymore when I use t3.xlarge when I try to upgrade to postgres 11, I left it overnight, and I don't see the read spike anymore, so I thought it has been fixed, then I scale it down to its original (t3.small) it then came back, at least on medium it doesn't chug all the burst .. sigh ..

  • @JuanM I'd agree that the memory increase only hid the situation. I'm just wanting to know why it happened all of the sudden. No ramp up in the previous days, just one day we woke up and the load was there.

0

This needs to be looked at by AWS support for your instance and underlying EBS disks but from what you describe, it seems like it may from AWS end if its happening at exact interval of 15 min without ANY of your workload. If you have not already, please open case with AWS support and they should be able to troubleshoot this for you and diagnose what may be happening at EBS disks that is used by your RDS read replica.

Bakul_R
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions