How do I troubleshoot the error "canceling statement due to conflict with recovery" when querying the read replica for my RDS for PostgreSQL DB instance?

5 minute read
0

I've set up a read replica for my Amazon Relational Database Service (Amazon RDS) for PostgreSQL instance. I get the error "canceling statement due to conflict with recovery" when I query the read replica.

Short description

This error might occur due to the lack of visibility from the primary instance over the activity that's happening on the read replica. The conflict with recovery occurs when WAL information can't be applied on the read replica because the changes might obstruct an activity that's happening on the read replica.

For example, suppose that you run a DROP statement on the primary instance when a long running SELECT statement is running on the read replica on the table that's dropped on the primary instance. Then, the read replica has two options:

  • Wait for the SELECT statement to be finished before applying the WAL record. In this case, the replication lag increases.
  • Apply the WAL record, and then cancel the SELECT statement. In this case, you get the error "canceling statement due to conflict with recovery".

The read replica resolves these replication conflicts based on the value of the parameters max_standby_streaming_delay and max_standby_archive_delay. The max_standby_streaming_delay parameter determines how long the read replica must wait before canceling standby queries that conflict with WAL entries that are about to be applied. If the conflicting statement is still running after this period, then PostgreSQL cancels the statement and issues the following error message:

ERROR: canceling statement due to conflict with recovery

You get this error typically due to long running queries on the read replica.

In the preceding example with the DROP statement, the DROP request is stored on the WAL file for applying later on the read replica for consistency. Suppose that a SELECT statement is already running on the read replica that tries to retrieve data from the dropped object with a runtime that's more than the value in max_standby_streaming_delay. Then, the SELECT statement is canceled so that the DROP statement can be applied.

Session 1 (Read replica) Run a SELECT statement on example_table:

postgres=> SELECT * from example_table;

Session 2 (Primary) Run a DROP statement on example_table:

postgres=> DROP TABLE example_table;

You get the following error:

postgres@postgres:[27544]:ERROR:  canceling statement due to conflict with recovery
postgres@postgres:[27544]:DETAIL:  User was holding a relation lock for too long.
postgres@postgres:[27544]:STATEMENT:  select * from example_table;

Also, query conflict might happen when a transaction on the read replica is reading tuples that are set for deletion on the primary instance. The deletion of tuples followed by vacuuming on the primary instance causes a conflict with the SELECT query that's still running on the replica. In this case, the SELECT query on the replica is terminated with the following error message:

ERROR:  canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.

Resolution

When a read replica encounters a conflict, and you get the error "canceling statement due to conflict with recovery" on the error log, you can set certain custom parameters based on the error message to reduce the impact of the conflict. Note that the custom parameters must be set on the read replica.

You get the error "canceling statement due to conflict with recovery" with "DETAIL: User was holding a relation lock for too long"

max_standby_streaming_delay/max_standby_archive_delay: You can use these parameters to allow more time before canceling standby statements that conflict with the about-to-be-applied WAL entries. These values represent the total amount of time that's allowed for applying WAL data after the data is received from the primary instance. These parameters are specified depending on from where the WAL data is read. If WAL data is read from streaming replication, then use the max_standby_streaming_delay parameter. If WAL data is read from the archive location in Amazon Simple Storage Service (Amazon S3), then use the max_standby_archive_delay parameter.

Keep the following in mind when setting these parameters:

  • If you set the values of these parameters to -1, the replica instance is allowed to wait forever for conflicting queries to complete, increasing the replication lag.
  • If you set the values of these parameters to 0, then the conflicting queries are canceled, and the WAL entries are applied on the replica instance.
  • The default value for these parameters is set to 30 seconds.
  • If you don't specify the unit when setting these parameters, millisecond is considered as the unit.

Tune the values of these parameters to balance query cancellation or replication lag based on your use case.

Note: If you are increasing max_standby_archive_delay to avoid canceling queries that conflict with reading WAL archive entries, then consider increasing max_standby_streaming_delay as well to avoid cancelations linked to conflict with streaming WAL entries.

You get the error "canceling statement due to conflict with recover" with "DETAIL: User query might have needed to see row versions that must be removed"

hot_standby_feedback: If you activate this parameter, feedback messages are sent to the primary instance from the read replica with information of the oldest active transaction. Therefore, the primary instance doesn't remove records that the transaction might need.

When you activate this parameter on the read replica, long-running queries on the read replica might lead to table bloat on the primary instance. This is because vacuum operations don't remove the dead tuples that might be required by queries running on the read replica. This parameter is turned off by default. Therefore, exercise caution when activating this parameter.

You can also inspect the pg_stat_database_conflicts view on the read replica for statistics about statements that are canceled due to conflicts with recovery on the read replica.


AWS OFFICIAL
AWS OFFICIALUpdated 2 years ago