How do I troubleshoot an AWS DMS task that is failing with error message "ERROR: canceling statement due to statement timeout"?

6 minute read
0

I'm migrating data to or from my on-premises PostgreSQL database using AWS Database Migration Service (AWS DMS). The AWS DMS task runs normally for a while, and then the task fails with an error. How do I troubleshoot and resolve these errors?

Short description

If the PostgreSQL database is the source of your migration task, then AWS DMS gets data from the table during the full load phase. Then, AWS DMS reads from the write-ahead logs (WALs) that are kept by the replication slot during the change data capture (CDC) phase.

If the PostgreSQL database is the target, then AWS DMS gets the data from the source and creates CSV files in the replication instance. Then, AWS DMS runs a COPY command to insert those records into the target during the full load phase.

But, during the CDC phase, AWS DMS runs the exact DML statements from the source WAL logs in transactional apply mode. For batch apply mode, AWS DMS also creates CSV files during the CDC phase. Then, it runs a COPY command to insert the net changes to the target.

When AWS DMS tries to either get data from source or put data in the target, it uses the default timeout setting of 60 seconds. If the source or target is heavily loaded or there are locks in the tables, then AWS DMS can't finish running those commands within 60 seconds. So, the task fails with an error that says "canceling statement due to statement timeout," and you see one of these entries in the log:

Messages:

"]E: RetCode: SQL_ERROR SqlState: 57014 NativeError: 1 Message: ERROR: canceling statement due to statement timeout; Error while executing the query [1022502] (ar_odbc_stmt.c:2738)"

"]E: test_decoding_create_replication_slot(...) - Unable to create slot 'lrcyli7hfxcwkair_00016402_8917165c_29f0_4070_97dd_26e78336e01b' (on execute(...) phase) [1022506] (postgres_test_decoding.c:392))"

To troubleshoot and resolve these errors, follow these steps:

  • Identify the cause of long run times for commands.
  • Increase the timeout value and check the slot creation timeout value.
  • Troubleshoot slot creation issues.

Resolution

Identify the cause of long run times for commands

To find the command that failed to run during the timeout period, review the AWS DMS task log and the table statistics section of the task. You can also find this information in the PostgreSQL error log file if the parameter log_min_error_statement is set to ERROR or a lower severity. After identifying the command that failed, you can find the failed table names. See this example error message from the PostgreSQL error log:

ERROR: canceling statement due to statement timeout 
STATEMENT: <The statement executed>"

To find locks on the associated tables, run this command in the source or target (depending where the error is appearing):

SELECT blocked_locks.pid AS blocked_pid,
 blocked_activity.usename AS blocked_user,
 blocking_locks.pid AS blocking_pid,
         blocking_activity.usename AS blocking_user, 
         blocked_activity.query    AS blocked_statement,
         blocking_activity.query   AS current_statement_in_blocking_process
   FROM  pg_catalog.pg_locks         blocked_locks 
    JOIN pg_catalog.pg_stat_activity blocked_activity  ON blocked_activity.pid = blocked_locks.pid
    JOIN pg_catalog.pg_locks         blocking_locks 
        ON blocking_locks.locktype = blocked_locks.locktype 
        AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
        AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
        AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
        AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
        AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
        AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
        AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
        AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
        AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
        AND blocking_locks.pid != blocked_locks.pid 
    JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
   WHERE NOT blocked_locks.GRANTED;

If you find any PIDs that are blocked, stop or terminate the blocked PID by running this command:

SELECT pg_terminate_backend(blocking_pid);

Because dead rows, or "tuples," can increase SELECT time, check for large numbers of dead rows in the source tables by running this command:

select * from pg_stat_user_tables where relname= 'table_name';

Check to see if the failed target table has primary keys or unique indexes. If the table has no primary keys or unique indexes, then a full table scan is performed during the running of any UPDATE statement. This table scan can take a long time.

Increase the timeout value

AWS DMS uses the executeTimeout extra connection attribute in both the source and target endpoints. The default value for executeTimeout is 60 seconds, so AWS DMS times out if a query takes longer than 60 seconds to run.

If the error appears in Source_Unload or Source_Capture, then set the timeout value for executeTimeout in the source. If the error appears in Target_Load or Target_Apply, set the timeout value for executeTimeout in the target. Increase the timeout value setting by following these steps:

1.    Open the AWS DMS console.

2.    Choose Endpoints from the navigation pane.

3.    Choose the PostgreSQL endpoint.

4.    Choose Actions, and select Modify.

5.    Expand the Endpoint-specific settings section.

6.    In the field for Extra connection attributes, enter this value:

executeTimeout=3600;

7.    Choose Save.

8.    From the Endpoints pane, choose the name of your PostgreSQL endpoint.

9.    From the Connections section, the Status of the endpoint changes from Testing to Successful.

You can increase (in milliseconds) the statement_timeout parameter in the PostgreSQL DB instance. The default value is 0, which turns off timeouts for any query. You can also increase the lock_timeout parameter. The default value is 0, which turns off timeouts for locks.

Troubleshoot slot creation issues

If the timeout occurred when you created the replication slot in the PostgreSQL database, then you see log entries similar to the following:

Messages

"]E: test_decoding_create_replication_slot(...) - Unable to create slot 'lrcyli7hfxcwkair_00016402_8917165c_29f0_4070_97dd_26e78336e01b' (on execute(...) phase) [1022506] (postgres_test_decoding.c:392)"

You can increase this timeout by configuring the TransactionConsistencyTimeout parameter in the Task settings section. The default value is 600 seconds.

PostgreSQL can't create the replication slot if there are any active locks in the database user tables. Check for locks by running this command:

select * from pg_locks;

Then, to test whether the error has been resolved, run this command to manually create the replication slot in the source PostgreSQL database:

select  xlog_position FROM pg_create_logical_replication_slot('<Slot name as per
    the task log>', 'test_decoding');

If the command still can't create the slot, then you might need to work with a PostgreSQL DBA to identify the bottleneck and configure your database. If the command is successful, delete the slot that you just created as a test:

select pg_drop_replication_slot(‘<slot name>');

Finally, restart your migration task.


Related information

PostgreSQL documentation for client connection defaults

Extra connection attributes when using PostgreSQL as a target for AWS DMS

Extra connection attributes when using PostgreSQL as a DMS source

Using a PostgreSQL database as an AWS DMS source

AWS OFFICIAL
AWS OFFICIALUpdated 2 years ago