Global outage event
If you're experiencing issues with your AWS services, then please refer to the AWS Health Dashboard. You can find the overall status of ongoing outages, the health of AWS services, and the latest updates from AWS engineers.
How do I troubleshoot AWS DMS replication tasks that have stopped or stalled during data migration?
I want to troubleshoot my AWS Database Migration Service (DMS) replication tasks that have stopped or stalled during data migration.
Resolution
If you run an AWS DMS replication task and experience a stopped or stalled state, then review Amazon CloudWatch metrics and AWS DMS task logs to identify the reason. Run the TestConnection API operation for both source and target endpoints. And then, follow the resolution steps that match your specific error.
Troubleshoot insufficient storage on replication instance
AWS DMS reads the changes from the source faster than the source can apply the changes to the target. AWS DMS stores pending changes in memory. If you met the memory quotas, then AWS DMS stores pending changes to disk as swap files. For more information, see What are swap files and why do the files use space on my AWS DMS instance?
If your task unexpectedly stops and you have critically low FreeStorageSpace or high CDCChangeDiskSource target metrics, then you receive the following error message:
"[SORTER]W: Reading from source is paused temporarily to enhance performance and avoid storage being full on replication instance. Total storage used by swap files exceeded the limit 1048576000 bytes, please consider checking target latency"
To resolve this error, complete the following steps:
-
Stop the task, and then resume the task to clear the swap files.
Note: If you stop and then restart or resume the task, then the AWS DMS task process releases the memory and resets stuck or slow connections to the target. After the task starts, AWS DMS applies the pending changes gradually from the existing swaps files on the disk. -
Check for large or long-running uncommitted transactions on the source. Uncommitted transactions on the source force AWS DMS to hold change data in memory and swap files until you commit or roll back the transactions.
-
To identify transactions, run the following system views query on your source database.
Note: After you identify long-running transactions, coordinate with the application team to commit, roll back, or terminate the stuck sessions.
For PostgreSQL:pg_stat_activityFor MySQL:
information_schema.innodb_trxFor Oracle:
v$transactionFor SQL Server:
sys.dm_tran_active_transactions -
Run the following query for your source to check for long-running uncommitted transactions.
For MySQL:SELECT trx_id, trx_state, trx_started, TIMESTAMPDIFF(MINUTE, trx_started, NOW()) AS running_minutes, trx_rows_locked, trx_rows_modified, trx_query FROM information_schema.innodb_trx ORDER BY trx_started;For Amazon Relational Database Service (Amazon RDS):
SELECT s.sid, s.serial#, s.username, s.program, t.start_time, t.status, t.log_io, t.phy_io, ROUND((SYSDATE - TO_DATE(t.start_time, 'MM/DD/YY HH24:MI:SS')) * 24 * 60, 2) AS elapsed_minutes FROM v$session s JOIN v$transaction t ON s.taddr = t.addr ORDER BY t.start_time;For Amazon RDS for SQL Server:
SELECT s.session_id, s.login_name, s.host_name, s.program_name, s.status, t.transaction_id, t.name AS transaction_name, ta.transaction_begin_time, DATEDIFF(MINUTE, ta.transaction_begin_time, GETDATE()) AS open_minutes, ta.transaction_type, ta.transaction_state, st.text AS last_sql_text FROM sys.dm_tran_session_transactions t JOIN sys.dm_exec_sessions s ON t.session_id = s.session_id JOIN sys.dm_tran_active_transactions ta ON t.transaction_id = ta.transaction_id CROSS APPLY sys.dm_exec_sql_text(s.most_recent_sql_handle) AS st ORDER BY ta.transaction_begin_time ASC; -
Turn on the BatchApplyEnabled parameter in your task settings to group multiple changes together and apply as bulk operations to the target.
-
Increase the MemoryLimitTotal parameter. Make sure that your replication instance class has sufficient memory to support the increased quota.
-
Increase the MemoryKeepTime parameter.
-
Identify target latency. If the target database is slow to accept changes, then AWS DMS accumulates a backlog. Use the CDCLatencyTarget metric to monitor target apply latency.
-
If swap files consume available disk space, then increase replication instance storage.
-
Turn off detailed debug logging. If you activated detailed debug logging, then excessive logs can consume storage space over time and slow down the task.
Troubleshoot PostgreSQL replication slot issues
Follow the resolution steps that match your specific error.
Replication slot is active error
If your task stalls during change data capture (CDC) or you have an increase in pending write-ahead logs (WAL) on your PostgreSQL source, then you might receive the following error message:
"ERROR: replication slot "dms_slot" is active for PID [number]"
The error occurs because the AWS DMS task doesn't consume WAL files even though the replication slot appears active.
To resolve this error, complete the following steps:
- Connect to your PostgreSQL source database to verify that the slot is stuck.
- Run the pg_replication_slots query multiple times over several minutes to confirm that restart_lsn doesn't advance.
- To end the slot connection, use pg_terminate_backend() in your PostgreSQL source database on the active packet identifier (PID) in the replication slot query.
- Stop the task.
- Modify the task with CDC start position.
- Restart the task.
For more troubleshooting steps, see Why did my AWS DMS CDC task fail with "Error 1236" when I used MySQL as the source?
WAL segment already removed error
If restart_lsn and confirmed_lsn don't advance even if the slot shows active status, then you might receive the following error message:
"ERROR: requested WAL segment has already been removed"
The error occurs when AWS DMS requires a WAL segment but it's no longer at the source.
To resolve this error, set the wal_sender_timeout parameter to 0 at the source. For large transactions, make sure that you set the MemoryKeepTime parameter to a large enough value to keep transactions. For more information, see wal_sender_timeout parameter on the PostgreSQL website.
Or, you can restart the task from full-load, include CDC changes collected from the point the task started.
Exceeds maximum reserved size error
If the WAL files exceed the quota and the replication slot in the source database is invalidated, then you might get the following error message:
"ERROR: This slot has been invalidated because it exceeded the maximum reserved size."
To resolve this error, increase the maximum amount of WAL for a replication slot specified in the source parameter group.
For PostgreSQL 13 and later, apply the max_slot_wal_keep_size parameter on the source database.
Troubleshoot Oracle archived redo log issues
AWS DMS must access the required Oracle archived redo log files needed for CDC replication. If your CDC stalls after full-load, your task can't progress past a certain System Change Number (SCN), or you see LogMiner errors in your task logs, then you receive the following error message:
"ORA-00308: cannot open archived log"
To resolve this error for RDS for Oracle sources, complete the following steps:
-
To increase the archived log retention period, run the following command:
EXEC rdsadmin.rdsadmin_util.set_configuration('archivelog retention hours', 24);COMMIT; -
To confirm the current configuration setting, run the following command.
EXEC rdsadmin.rdsadmin_util.show_configuration; -
To check that the required archived log exists, run the following query:
SELECT name, sequence#, first_time, next_time, archived, deleted, status FROM v$archived_log WHERE sequence# = "missing_sequence_number" ORDER BY first_time DESC; -
To check the current log mode, run the following query:
SELECT log_mode FROM v$database; -
To increase retention to 48 hours, run the following command:
EXEC rdsadmin.rdsadmin_util.set_configuration('archivelog retention hours', 48);COMMIT;
To resolve this error for self-managed Oracle sources, complete the following steps:
-
To modify your archive log deletion policy to retain logs longer, run the following command.
Note: Make sure that you set the retain parameter for the archive logs long enough for AWS DMS to process them.CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 7 DAYS;DMS Configuration Adjustments -
In your AWS DMS source endpoint, add the following endpoint settings (formerly extra connection attributes) to use Binary Reader to access the redo logs:
useLogminerReader=N;useBfile=Y;archivedLogDestId="dest_id"
Troubleshoot permission and authentication errors
If your task stops during CDC or the task can't read from the source or write to target, then you receive one of the following error messages:
"MySQL: ERROR: Access denied for user / You need (at least one of) the SUPER,REPLICATION CLIENT privilege(s) for this operation"
"PostgreSQL: ERROR: permission denied for table / must be superuser or replication role"
"SQL Server: ERROR: The SELECT permission was denied / User does not have permission"
"Oracle: ERROR: ORA-01031: insufficient privileges / ORA-28000: account is locked"
To resolve the error, connect to your source database with an administrator AWS account to grant permissions for your database, see the following AWS documentation for your database:
- For MySQL, see Using any MySQL-compatible database as a source for AWS DMS.
- For PostgreSQL database, see Using a PostgreSQL database as an AWS DMS source.
- For Microsoft SQL Server database, see Using a Microsoft SQL Server database as a source for AWS DMS.
- For Oracle database, see Step 1: Configure Your Oracle Source Database.
To verify your credentials, complete the following steps:
- Open the AWS DMS console.
- In the navigation pane, choose Endpoints.
- Select the endpoint.
- Choose Actions, and then choose Modify.
- Choose Save.
- To test the connection, select the endpoint, and then choose Test connection.
Troubleshoot AWS DMS Serverless scaling issues
AWS DMS Serverless must have sufficient DMS capacity units (DCUs) to scale quickly enough to handle traffic spikes. If your task stalls during high-traffic periods, scaling takes several minutes, or task stops during scale-up operations, then you receive the following error message:
"Last Error Not enough memory to allocate. Stop Reason FATAL_ERROR Error Level FATAL"
To resolve this error, increase the MinCapacityUnits parameter for the DCU capacity for the AWS DMS serverless replications. Then, resume the task. To pre-scale for peak periods, increase DCUs before expected high-traffic times, and then use AWS Lambda with Amazon EventBridge to automate pre-scaling.
If your workload is consistent and predictable, then create a standard replication instance instead of serverless instance for more stable performance.
For more information, see How do I troubleshoot an AWS DMS "last error replication task out of memory" error?
Related Information
Troubleshooting migration tasks in AWS Database Migration Service
- Language
- English
