DMS - Ongoing replication causing duplicate records due to redshift cluster maintenance

0

Hi,

I'm facing an issue with DMS ongoing replication causing a small number of duplicate records due to the target redshift cluster going down for maintenance every week. One of the error messages in the cloudwatch logs is Failed to start rollback apply transaction, so it looks like DMS fails to undo a batch, and with RecoverableErrorCount set to -1, the process ends up retrying the ongoing replication from the batch it already ran, resulting in the duplicates.

Is there a way via the DMS settings to handle this rollback failure issue caused by the cluster maintenance? Or maybe a way to work around the cluster going down for maintenance?

Thanks!

Below are my settings:

Source: Mysql Target: Redshift

{ "ErrorBehavior": { "FailOnNoTablesCaptured": true, "ApplyErrorUpdatePolicy": "SUSPEND_TABLE", "FailOnTransactionConsistencyBreached": true, "RecoverableErrorThrottlingMax": 1800, "DataErrorEscalationPolicy": "SUSPEND_TABLE", "ApplyErrorEscalationCount": 5, "RecoverableErrorStopRetryAfterThrottlingMax": true, "RecoverableErrorThrottling": true, "ApplyErrorFailOnTruncationDdl": false, "DataTruncationErrorPolicy": "SUSPEND_TABLE", "ApplyErrorInsertPolicy": "SUSPEND_TABLE", "EventErrorPolicy": "IGNORE", "ApplyErrorEscalationPolicy": "SUSPEND_TABLE", "RecoverableErrorCount": -1, "DataErrorEscalationCount": 5, "TableErrorEscalationPolicy": "SUSPEND_TABLE", "RecoverableErrorInterval": 5, "ApplyErrorDeletePolicy": "SUSPEND_TABLE", "TableErrorEscalationCount": 5, "FullLoadIgnoreConflicts": true, "DataErrorPolicy": "SUSPEND_TABLE", "TableErrorPolicy": "SUSPEND_TABLE" }, "TTSettings": { "TTS3Settings": null, "TTRecordSettings": null, "EnableTT": false }, "FullLoadSettings": { "CommitRate": 10000, "StopTaskCachedChangesApplied": false, "StopTaskCachedChangesNotApplied": false, "MaxFullLoadSubTasks": 8, "TransactionConsistencyTimeout": 1200, "CreatePkAfterFullLoad": false, "TargetTablePrepMode": "DROP_AND_CREATE" }, "TargetMetadata": { "ParallelApplyBufferSize": 10000, "ParallelApplyQueuesPerThread": 0, "ParallelApplyThreads": 32, "TargetSchema": "", "InlineLobMaxSize": 0, "ParallelLoadQueuesPerThread": 0, "SupportLobs": true, "LobChunkSize": 0, "TaskRecoveryTableEnabled": true, "ParallelLoadThreads": 32, "LobMaxSize": 63, "BatchApplyEnabled": true, "FullLobMode": false, "LimitedSizeLobMode": true, "LoadMaxFileSize": 0, "ParallelLoadBufferSize": 1000 }, "BeforeImageSettings": null, "ControlTablesSettings": { "historyTimeslotInMinutes": 5, "HistoryTimeslotInMinutes": 5, "StatusTableEnabled": true, "SuspendedTablesTableEnabled": true, "HistoryTableEnabled": true, "ControlSchema": "dms_control", "FullLoadExceptionTableEnabled": true }, "LoopbackPreventionSettings": null, "CharacterSetSettings": null, "FailTaskWhenCleanTaskResourceFailed": false, "ChangeProcessingTuning": { "StatementCacheSize": 50, "CommitTimeout": 1, "BatchApplyPreserveTransaction": true, "BatchApplyTimeoutMin": 1, "BatchSplitSize": 0, "BatchApplyTimeoutMax": 30, "MinTransactionSize": 1000, "MemoryKeepTime": 60, "BatchApplyMemoryLimit": 500, "MemoryLimitTotal": 1024 }, "ChangeProcessingDdlHandlingPolicy": { "HandleSourceTableDropped": true, "HandleSourceTableTruncated": true, "HandleSourceTableAltered": true }, "PostProcessingRules": null }

1개 답변
0
수락된 답변

Dear Customer,

A very warm greeting from AWS! I hope that you are doing good!

After going through the post, I understand that you have started to witnessed the duplicate records in the ongoing replication post redshift cluster maintenance with the error "Failed to start rollback apply transaction" in dms logs. Hence you would like to know any ways/steps to rectify the error and also a workaround to handle dms task when the redshift cluster goes for a maintenance .

I would like to inform you that duplication of records into the target redshift has already been fixed in the replication instance version 3.4.5 as per link [1]. Hence, please make sure to use the replication instance version 3.4.5 and above version.

[+] AWS DMS release notes - AWS Database Migration Service 3.4.5 release notes - https://docs.aws.amazon.com/dms/latest/userguide/CHAP_ReleaseNotes.html#CHAP_ReleaseNotes.DMS345

In case if you are already using the replication instance with 3.4.5 and above; then we require task information along with the logs to get more details on this error.

Therefore, I would kindly request you to raise a support case along with task details for further investigation using the below link.

[+] https://support.console.aws.amazon.com/support/home?#/case/create

In regards to redshift cluster maintenance, as an alternative approach ~ you may consider setting up a RedShift events notification for each maintenance start (REDSHIFT-EVENT-2003) and (REDSHIFT-EVENT-2004) to trigger a Lambda function which will automatically stop and resume the DMS Task based on the event. Please go through the links [1] and [2] for more details.

[1] Amazon Redshift event notifications - Amazon Redshift event categories and event messages - https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-event-notifications.html#redshift-event-messages

[2] Automating DMS Tasks - https://aws.amazon.com/blogs/database/automating-aws-dms-migration-tasks

Have a good day!

지원 엔지니어
답변함 2년 전
profile picture
전문가
검토됨 한 달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠