RDS postgres hot_standby_feedback not preventing recovery conflict

0

I have an RDS (non-Aurora) postgres primary (multi-AZ) with associated readonly replica. We were seeing queries canceled on the readonly due to recovery conflicts, so I changed hot_standby_feedback=1 in the parameter group for the readonly replica. This resolved some conflicts, meaning we are seeing fewer of them, but it hasn't resolved them all. Today I saw this in the readonly replica's log:

ERROR: canceling statement due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed.

According to this document I found in the knowledge center (https://aws.amazon.com/premiumsupport/knowledge-center/rds-postgresql-error-conflict-recovery/), the solution for this is to set hot_standby_feedback=1 (true). This is also backed by postgres documentation, which states this:

"Remedial possibilities exist if the number of standby-query cancellations is found to be unacceptable. The first option is to set the parameter hot_standby_feedback, which prevents VACUUM from removing recently-dead rows and so cleanup conflicts do not occur. If you do this, you should note that this will delay cleanup of dead rows on the primary, which may result in undesirable table bloat. However, the cleanup situation will be no worse than if the standby queries were running directly on the primary server, and you are still getting the benefit of off-loading execution onto the standby. If standby servers connect and disconnect frequently, you might want to make adjustments to handle the period when hot_standby_feedback feedback is not being provided. For example, consider increasing max_standby_archive_delay so that queries are not rapidly canceled by conflicts in WAL archive files during disconnected periods. You should also consider increasing max_standby_streaming_delay to avoid rapid cancellations by newly-arrived streaming WAL entries after reconnection." (see https://www.postgresql.org/docs/12/hot-standby.html#HOT-STANDBY-CONFLICT)

I thought that by setting hot_standby_feedback=1 that we should not encounter anymore conflicts due to "User query might have needed to see row versions that must be removed". Why are we still seeing them?

Has anyone else encountered this problem? Any thoughts on how to correct it?

2 Answers
0

You contacted us because you observed the following error in the replica of your database even after hot_standby_feedback set to1 and you wanted to know how to fix this.

ERROR: canceling statement due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed.

Cause of the Issue

Usually the cause of this error comes when some queries running in Read replica trying to fetch the rows, and these rows are already updated/deleted in primary database OR Vacuuming in the primary already cleaned up the old version of these rows. In this scenario since the old version of rows was already cleaned up then the same goes to Read replica through WAL replication, So finally replica faces the conflict with recovery.

Below two settings control the behavior of a standby server that is to receive replication data.

A) max_standby_archive_delay (integer): Sets the maximum delay before canceling queries when a hot standby server is processing archived WAL data. max_standby_archive_delay applies when WAL data is being read from WAL archive When Hot Standby is active, this parameter determines how long the standby server should wait before canceling standby queries that conflict with about-to-be-applied WAL entries. max_standby_archive_delay applies when WAL data is being read from WAL archive (and is therefore not current). The default is 30 seconds.

https://postgresqlco.nf/doc/en/param/max_standby_archive_delay/

B) max_standby_streaming_delay (integer): Sets the maximum delay before canceling queries when a hot standby server is processing streamed WAL data. max_standby_streaming_delay applies when WAL data is being received via streaming replication. When Hot Standby is active, this parameter determines how long the standby server should wait before canceling standby queries that conflict with about-to-be-applied WAL entries. max_standby_streaming_delay applies when WAL data is being received via streaming replication. In case of Postgres 11 version it is 14 seconds and its modifiable to a maximum value of 30 sec only.

https://postgresqlco.nf/doc/en/param/max_standby_streaming_delay/

Suggestions

Unfortunately, as we don't have access to your data or databases due to AWS security and privacy policies, I would like to provide you with the following recommendations to resolve this issue and you may also consider to engage a DBA/Application team to investigate further.

If there are continuous long running queries present in reader instance, this workload pattern eventually causes the reader transactions to terminate due to conflict with recovery. Please make sure that the activities that require a highly restricted level of locking run when there is less activity on the reader instance. An example for this would be DROP and ALTER statements executing on the writer instance. To maintain data consistency, Postgres could cancel the transactions running against these relations on your reader instance. Please do not DROP tables or tablespaces on your writer while there are long running queries or open cursors running against those tables on your reader.

If you have long running queries on your reader instance, add an application side delay on queries running on the writer so they don't coincide with your reader queries. For example, if a long running transaction T1 is already running on your reader, running a transaction T2 on the writer at the same time, will lead to the transactions on the reader getting cancelled since they would result in dirty or phantom reads based on your isolation level. In such a case, your transaction T2 on the writer should be delayed until the transaction T1 on the reader completes. This will ensure consistent reads and that there are no lock conflicts. Set the following parameters to add the delays or test different values to get the optimal setting for your environment. As always, it is important to try these settings in a test environment before promoting it to production.

-max_standby_archive_delay=-1 <----(A value of -1 allows the standby to wait forever for conflicting queries to complete.This can introduce application performance issues. Try different values, max value is 2147483647milliseconds)

-max_standby_streaming_delay=-1 <----((A value of -1 allows the standby to wait forever for conflicting queries to complete.This can introduce application performance issues. Try different values, max value is 2147483647milliseconds)

On a database that has many active transactions, the likelihood of hitting the error increases. Tune your application to operate in batches if possible. If you tune your application workload to operate in batches, tables/resources in your writer are not held with locks for a long time while there are queries running on the same tables/resources on your replica.

I would like to suggest you to refer below AWS documents for reference. [+] https://aws.amazon.com/blogs/database/best-practices-for-amazon-rds-for-postgresql-cross-region-read-replicas/

answered a year ago
0

If we implement the max_* parameters as described above, should we leave hot_standby_feedback=1 or set that back to hot_standby_feedback=0? I'm clear not on what hot_standby_feedback is accomplishing.

KrisO
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions