DMS - Ongoing replication causing duplicate records due to redshift cluster maintenance

0

Hi,

I'm facing an issue with DMS ongoing replication causing a small number of duplicate records due to the target redshift cluster going down for maintenance every week. One of the error messages in the cloudwatch logs is Failed to start rollback apply transaction, so it looks like DMS fails to undo a batch, and with RecoverableErrorCount set to -1, the process ends up retrying the ongoing replication from the batch it already ran, resulting in the duplicates.

Is there a way via the DMS settings to handle this rollback failure issue caused by the cluster maintenance? Or maybe a way to work around the cluster going down for maintenance?

Thanks!

Below are my settings:

Source: Mysql Target: Redshift

{ "ErrorBehavior": { "FailOnNoTablesCaptured": true, "ApplyErrorUpdatePolicy": "SUSPEND_TABLE", "FailOnTransactionConsistencyBreached": true, "RecoverableErrorThrottlingMax": 1800, "DataErrorEscalationPolicy": "SUSPEND_TABLE", "ApplyErrorEscalationCount": 5, "RecoverableErrorStopRetryAfterThrottlingMax": true, "RecoverableErrorThrottling": true, "ApplyErrorFailOnTruncationDdl": false, "DataTruncationErrorPolicy": "SUSPEND_TABLE", "ApplyErrorInsertPolicy": "SUSPEND_TABLE", "EventErrorPolicy": "IGNORE", "ApplyErrorEscalationPolicy": "SUSPEND_TABLE", "RecoverableErrorCount": -1, "DataErrorEscalationCount": 5, "TableErrorEscalationPolicy": "SUSPEND_TABLE", "RecoverableErrorInterval": 5, "ApplyErrorDeletePolicy": "SUSPEND_TABLE", "TableErrorEscalationCount": 5, "FullLoadIgnoreConflicts": true, "DataErrorPolicy": "SUSPEND_TABLE", "TableErrorPolicy": "SUSPEND_TABLE" }, "TTSettings": { "TTS3Settings": null, "TTRecordSettings": null, "EnableTT": false }, "FullLoadSettings": { "CommitRate": 10000, "StopTaskCachedChangesApplied": false, "StopTaskCachedChangesNotApplied": false, "MaxFullLoadSubTasks": 8, "TransactionConsistencyTimeout": 1200, "CreatePkAfterFullLoad": false, "TargetTablePrepMode": "DROP_AND_CREATE" }, "TargetMetadata": { "ParallelApplyBufferSize": 10000, "ParallelApplyQueuesPerThread": 0, "ParallelApplyThreads": 32, "TargetSchema": "", "InlineLobMaxSize": 0, "ParallelLoadQueuesPerThread": 0, "SupportLobs": true, "LobChunkSize": 0, "TaskRecoveryTableEnabled": true, "ParallelLoadThreads": 32, "LobMaxSize": 63, "BatchApplyEnabled": true, "FullLobMode": false, "LimitedSizeLobMode": true, "LoadMaxFileSize": 0, "ParallelLoadBufferSize": 1000 }, "BeforeImageSettings": null, "ControlTablesSettings": { "historyTimeslotInMinutes": 5, "HistoryTimeslotInMinutes": 5, "StatusTableEnabled": true, "SuspendedTablesTableEnabled": true, "HistoryTableEnabled": true, "ControlSchema": "dms_control", "FullLoadExceptionTableEnabled": true }, "LoopbackPreventionSettings": null, "CharacterSetSettings": null, "FailTaskWhenCleanTaskResourceFailed": false, "ChangeProcessingTuning": { "StatementCacheSize": 50, "CommitTimeout": 1, "BatchApplyPreserveTransaction": true, "BatchApplyTimeoutMin": 1, "BatchSplitSize": 0, "BatchApplyTimeoutMax": 30, "MinTransactionSize": 1000, "MemoryKeepTime": 60, "BatchApplyMemoryLimit": 500, "MemoryLimitTotal": 1024 }, "ChangeProcessingDdlHandlingPolicy": { "HandleSourceTableDropped": true, "HandleSourceTableTruncated": true, "HandleSourceTableAltered": true }, "PostProcessingRules": null }

1 Answer
0
Accepted Answer

Dear Customer,

A very warm greeting from AWS! I hope that you are doing good!

After going through the post, I understand that you have started to witnessed the duplicate records in the ongoing replication post redshift cluster maintenance with the error "Failed to start rollback apply transaction" in dms logs. Hence you would like to know any ways/steps to rectify the error and also a workaround to handle dms task when the redshift cluster goes for a maintenance .

I would like to inform you that duplication of records into the target redshift has already been fixed in the replication instance version 3.4.5 as per link [1]. Hence, please make sure to use the replication instance version 3.4.5 and above version.

[+] AWS DMS release notes - AWS Database Migration Service 3.4.5 release notes - https://docs.aws.amazon.com/dms/latest/userguide/CHAP_ReleaseNotes.html#CHAP_ReleaseNotes.DMS345

In case if you are already using the replication instance with 3.4.5 and above; then we require task information along with the logs to get more details on this error.

Therefore, I would kindly request you to raise a support case along with task details for further investigation using the below link.

[+] https://support.console.aws.amazon.com/support/home?#/case/create

In regards to redshift cluster maintenance, as an alternative approach ~ you may consider setting up a RedShift events notification for each maintenance start (REDSHIFT-EVENT-2003) and (REDSHIFT-EVENT-2004) to trigger a Lambda function which will automatically stop and resume the DMS Task based on the event. Please go through the links [1] and [2] for more details.

[1] Amazon Redshift event notifications - Amazon Redshift event categories and event messages - https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-event-notifications.html#redshift-event-messages

[2] Automating DMS Tasks - https://aws.amazon.com/blogs/database/automating-aws-dms-migration-tasks

Have a good day!

SUPPORT ENGINEER
answered 2 years ago
profile picture
EXPERT
reviewed a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions