Skip to content

How do I troubleshoot logical replication issues for an Aurora PostgreSQL-Compatible source DB cluster?

6 minute read
0

I want to troubleshoot logical replication issues for my Amazon Aurora PostgreSQL-Compatible Edition source database (DB) cluster.

Resolution

Configure replication parameters

Before you troubleshoot logical replication issues, configure the following parameters for your DB cluster parameter group:

  • Set the rds.logical_replication value to 1 to activate logical replication on your DB cluster.
  • Set a value for max_replication_slots that includes enough slots for your expected subscription count and additional table synchronization processes.
  • Set max_wal_senders to a value that supports your max_replication_slots quota and your current physical replicas.
  • Set max_logical_replication_workers to a value that supports your subscription count and the number of additional workers that the replication system requires for table synchronization.
  • Set max_worker_processes to a value that's at least one more than the max_logical_replication_workers value. For example, if max_logical_replication_workers is 25, then set max_worker_processes to 26.
  • If data copy is slow, then increase the max_sync_workers_per_subscription value to control the number of synchronization workers that process the subscription setup and new table additions.

Important: To apply the preceding changes, you must reboot your Aurora PostgreSQL-Compatible DB cluster.

Check publication and subscription configuration

After you configure the logical replication parameters, verify that you correctly configured the publications and subscriptions. Make sure that the publication includes all the intended tables, the subscription parameters are correct, and the replication user has the appropriate permissions.

Connect to your source database, and then run the following commands.

To check your publication configuration, run the following commands:

SELECT * FROM pg_publication;
SELECT * FROM pg_publication_tables;

To check your subscription configuration, run the following commands:

SELECT * FROM pg_subscription;
SELECT * FROM pg_subscription_rel;

Confirm that the replication slots are active

Connect to your source database, and then run the following command:

SELECT slot_name, plugin, slot_type, active, restart_lsn,
       confirmed_flush_lsn, pg_current_wal_lsn(),
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag
FROM pg_replication_slots;

If slots are inactive or show excessive lag, then review the replication workers to troubleshoot the issue.

Confirm that the replication workers are active 

Connect to your target database, and then run the following command:

SELECT pid, state, query, wait_event, backend_type
FROM pg_stat_activity
WHERE backend_type LIKE 'logical replication%';

If there are no replication workers, then restart your subscription.

To turn off your subscription, run the following command:

ALTER SUBSCRIPTION subscription_name DISABLE;

To turn on your subscription, run the following command:

ALTER SUBSCRIPTION subscription_name ENABLE;

If there are still no replication workers after you restart the subscription, then check the PostgreSQL error logs for error messages.

Monitor replication progress and lag

Your replication might experience lag because of high transaction volume and large transactions on the publication. Or, there might be constraints on the publication or subscription.

To determine whether replication stopped or is slow, monitor the replication progress for a few intervals. 

Connect to your DB cluster, and then run the following commands.

To check the replication progress for your publication, run the following command:

SELECT slot_name,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag,
       active
FROM pg_replication_slots
WHERE slot_type = 'logical';

To check the replication progress for your subscription, run the following command:

SELECT subname, pid, received_lsn, latest_end_lsn,
       pg_size_pretty(pg_wal_lsn_diff(latest_end_lsn, received_lsn)) AS lag
FROM pg_stat_subscription;

To minimize the replication lag, monitor the write-through cache. If the cache size isn't sufficient for your workloads, then you can manually change the rds.logical_wal_cache value. For more information, see Achieve up to 17x lower replication lag with the new write-through cache for Aurora PostgreSQL.

Monitor resource constraints

To troubleshoot resource constraints on the publisher and subscriber, turn on Enhanced Monitoring and monitor CPUUtilization, FreeableMemory, SwapUsage, and NetworkThroughput. Set up Amazon CloudWatch alarms to be notified when replication issues occur.

For more information, see How do I identify and troubleshoot performance issues and slow-running queries in my Amazon RDS for PostgreSQL or Aurora PostgreSQL-Compatible DB instance?

Identify schema mismatches and resolve data inconsistencies

Connect to your source database and target database. Then, confirm that there are columns in the replicated tables and the data types are compatible. Also, make sure that the primary keys and unique constraints are consistent. 

To turn on your subscription, run the following command:

ALTER SUBSCRIPTION subscription_name ENABLE;

To compare table definitions, run the following command on both databases:

SELECT column_name, data_type, character_maximum_length
FROM information_schema.columns
WHERE table_name = 'your_table_name'
ORDER BY ordinal_position;

Note: Replace your_table_name with your table name.

Resolve conflicts

Native PostgreSQL logical replication can't detect data conflicts from multiple publishers or replicated modifications that conflict with data that you changed locally. If there's a current row with the same key, then Aurora applies the update and the insert fails. 

To identify what caused the conflict, check the PostgreSQL logs. 

The following example log shows that the replication failed because it tried to insert a record with an asset ID that already exists in the target database:

ERROR: 23505: duplicate key value violates unique constraint "asset_pkey"
DETAIL: Key (asset_id)=(7) already exists.
CONTEXT: processing remote data for replication origin "pg_32796" during message type "INSERT" for replication target relation "public.asset" in transaction 315434, finished at 0/6A12458

The replication origin is pg_32796 and finishes at Logical Sequence Number (LSN) 0/6A12458.

To manually fix the data, you can stop the replication on the conflict or configure the subscription with the disable_on_error option.

Or, you can review the data on the source and target to determine whether you can skip the LSN that caused the conflict. Then, use the pg_replication_origin_advance() function to skip the LSN that caused the conflict. For more information, see pg_replication_origin_advance (node_name text, lsn pg_lsn) on the PostgreSQL website.

Note: Aurora PostgreSQL-Compatible versions 15 and later support the pg_replication_origin_advance() function.

To skip the LSN, complete the following steps:

  1. Run the following SQL command to temporarily turn off the subscription:

    ALTER SUBSCRIPTION subscription_name DISABLE;

    Note: If you configured the subscription with the disable_on_error option, then the subscription automatically turns off after an error.

  2. Use the following pg_replication_origin_advance() function to advance the origin to Finish_LSN+1:

    SELECT pg_replication_origin_advance('node_name','Finish_LSN+1'::pg_lsn);

    Note: Replace node_name with your node's name.

  3. Run the following command to turn on the subscription.:

    ALTER SUBSCRIPTION subscription-name ENABLE;

    Note: Replace subscription-name with your subscription name.

To resolve several data inconsistencies, you might need to clean the target table and reconfigure the publication and subscription.

Related information

Replication with Amazon Aurora PostgreSQL

PostgreSQL Logical Replication on PostgreSQL website

Setting up logical replication for your Aurora PostgreSQL DB cluster

Monitoring Amazon Aurora metrics with Amazon CloudWatch

AWS OFFICIALUpdated 2 months ago