AWS DMS 3.5.1 Tasks Suddenly start failing without errors

0

Why are DMS tasks suddenly failing with the following response:

Replication task has failed. Reason: Last Error The task stopped abnormally Stop Reason RECOVERABLE_ERROR Error Level RECOVERABLE.

After stopping all DMS tasks due to database maintenance, we slowly continued/resumed all DMS tasks. In the beginning, everything was working fine, but suddenly almost all tasks failed with the message above. After 6 retries the task goes into FAILED status with the response:

Replication task has failed. Reason: Last Error Task 'TASKID' was suspended due to 6 successive unexpected failures Stop Reason FATAL_ERROR Error Level FATAL.

After extensive research and troubleshooting, I'm finally out of ideas as to why this happened.

Here are the details:

DMS-Setup

Replication Instance with engine version 3.5.1 and a dms.t3.medium.

Tasks types: MS SQL -> POSTGRES MS SQL -> S3

Task Settings:

{
    "Logging": {
        "EnableLogging": true,
        "EnableLogContext": false,
        "LogComponents": [
            {
                "Severity": "LOGGER_SEVERITY_DEFAULT",
                "Id": "TRANSFORMATION"
            },
            {
                "Severity": "LOGGER_SEVERITY_DEFAULT",
                "Id": "SOURCE_UNLOAD"
            },
            {
                "Severity": "LOGGER_SEVERITY_DEFAULT",
                "Id": "IO"
            },
            {
                "Severity": "LOGGER_SEVERITY_DEFAULT",
                "Id": "TARGET_LOAD"
            },
            {
                "Severity": "LOGGER_SEVERITY_DEFAULT",
                "Id": "PERFORMANCE"
            },
            {
                "Severity": "LOGGER_SEVERITY_DEFAULT",
                "Id": "SOURCE_CAPTURE"
            },
            {
                "Severity": "LOGGER_SEVERITY_DEFAULT",
                "Id": "SORTER"
            },
            {
                "Severity": "LOGGER_SEVERITY_DEFAULT",
                "Id": "REST_SERVER"
            },
            {
                "Severity": "LOGGER_SEVERITY_DEFAULT",
                "Id": "VALIDATOR_EXT"
            },
            {
                "Severity": "LOGGER_SEVERITY_DEFAULT",
                "Id": "TARGET_APPLY"
            },
            {
                "Severity": "LOGGER_SEVERITY_DEFAULT",
                "Id": "TASK_MANAGER"
            },
            {
                "Severity": "LOGGER_SEVERITY_DEFAULT",
                "Id": "TABLES_MANAGER"
            },
            {
                "Severity": "LOGGER_SEVERITY_DEFAULT",
                "Id": "METADATA_MANAGER"
            },
            {
                "Severity": "LOGGER_SEVERITY_DEFAULT",
                "Id": "FILE_FACTORY"
            },
            {
                "Severity": "LOGGER_SEVERITY_DEFAULT",
                "Id": "COMMON"
            },
            {
                "Severity": "LOGGER_SEVERITY_DEFAULT",
                "Id": "ADDONS"
            },
            {
                "Severity": "LOGGER_SEVERITY_DEFAULT",
                "Id": "DATA_STRUCTURE"
            },
            {
                "Severity": "LOGGER_SEVERITY_DEFAULT",
                "Id": "COMMUNICATION"
            },
            {
                "Severity": "LOGGER_SEVERITY_DEFAULT",
                "Id": "FILE_TRANSFER"
            }
        ],
        "CloudWatchLogGroup": "NOT-A-REAL-LOGGROUP",
        "CloudWatchLogStream": "dms-task-TASKID"
    },
    "StreamBufferSettings": {
        "StreamBufferCount": 3,
        "CtrlStreamBufferSizeInMB": 5,
        "StreamBufferSizeInMB": 8
    },
    "ErrorBehavior": {
        "FailOnNoTablesCaptured": false,
        "ApplyErrorUpdatePolicy": "LOG_ERROR",
        "FailOnTransactionConsistencyBreached": false,
        "RecoverableErrorThrottlingMax": 1800,
        "DataErrorEscalationPolicy": "SUSPEND_TABLE",
        "ApplyErrorEscalationCount": 0,
        "RecoverableErrorStopRetryAfterThrottlingMax": false,
        "RecoverableErrorThrottling": true,
        "ApplyErrorFailOnTruncationDdl": false,
        "DataTruncationErrorPolicy": "LOG_ERROR",
        "ApplyErrorInsertPolicy": "LOG_ERROR",
        "EventErrorPolicy": "IGNORE",
        "ApplyErrorEscalationPolicy": "LOG_ERROR",
        "RecoverableErrorCount": -1,
        "DataErrorEscalationCount": 0,
        "TableErrorEscalationPolicy": "STOP_TASK",
        "RecoverableErrorInterval": 5,
        "ApplyErrorDeletePolicy": "IGNORE_RECORD",
        "TableErrorEscalationCount": 0,
        "FullLoadIgnoreConflicts": true,
        "DataErrorPolicy": "LOG_ERROR",
        "TableErrorPolicy": "SUSPEND_TABLE"
    },
    "TTSettings": {
        "TTS3Settings": null,
        "TTRecordSettings": null,
        "EnableTT": false
    },
    "FullLoadSettings": {
        "CommitRate": 10000,
        "StopTaskCachedChangesApplied": false,
        "StopTaskCachedChangesNotApplied": false,
        "MaxFullLoadSubTasks": 8,
        "TransactionConsistencyTimeout": 600,
        "CreatePkAfterFullLoad": false,
        "TargetTablePrepMode": "TRUNCATE_BEFORE_LOAD"
    },
    "TargetMetadata": {
        "ParallelApplyBufferSize": 0,
        "ParallelApplyQueuesPerThread": 0,
        "ParallelApplyThreads": 0,
        "TargetSchema": "",
        "InlineLobMaxSize": 0,
        "ParallelLoadQueuesPerThread": 0,
        "SupportLobs": true,
        "LobChunkSize": 0,
        "TaskRecoveryTableEnabled": false,
        "ParallelLoadThreads": 0,
        "LobMaxSize": 32,
        "BatchApplyEnabled": false,
        "FullLobMode": false,
        "LimitedSizeLobMode": true,
        "LoadMaxFileSize": 0,
        "ParallelLoadBufferSize": 0
    },
    "BeforeImageSettings": null,
    "ControlTablesSettings": {
        "historyTimeslotInMinutes": 10,
        "HistoryTimeslotInMinutes": 10,
        "StatusTableEnabled": true,
        "SuspendedTablesTableEnabled": true,
        "HistoryTableEnabled": true,
        "ControlSchema": "",
        "FullLoadExceptionTableEnabled": false
    },
    "LoopbackPreventionSettings": null,
    "CharacterSetSettings": null,
    "FailTaskWhenCleanTaskResourceFailed": false,
    "ChangeProcessingTuning": {
        "StatementCacheSize": 50,
        "CommitTimeout": 1,
        "BatchApplyPreserveTransaction": true,
        "BatchApplyTimeoutMin": 1,
        "BatchSplitSize": 0,
        "BatchApplyTimeoutMax": 30,
        "MinTransactionSize": 1000,
        "MemoryKeepTime": 60,
        "BatchApplyMemoryLimit": 500,
        "MemoryLimitTotal": 1024
    },
    "ChangeProcessingDdlHandlingPolicy": {
        "HandleSourceTableDropped": true,
        "HandleSourceTableTruncated": true,
        "HandleSourceTableAltered": true
    },
    "PostProcessingRules": null
}

What we tried:

  1. Increasing the replication instance size dms.c6i.4xlarge
  2. Different replication instances
  3. MaxFullLoadSubTasks between 1-8
  4. Creating a completely new replication instance and tasks
  5. Almost everything in the AWS DMS Troubleshooting Guide https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Troubleshooting.html
  6. Detailed Debug logging on all categories
  7. Stopping all tasks and slowly starting them one by one once the previous task was in CDC mode

Observations:

  • No errors in the logs (even with detailed debug enabled)
  • Tasks that ran for hours suddenly fail, then more and more follow
  • Once a task fails with the error above, it keeps failing until it gets suspended
  • Not all tasks are failing - Other tasks using the same source and target endpoints are running without issues
  • The instance size doesn't seem to be the problem
  • All tasks were running smoothly before the database server maintenance (This was a forced EC2 reboot from AWS)
  • No settings were changed on the source or the target side
  • No schema changes on the source or the target
  • No signs of issues on the metrics of the source, target, or replication instance

We are running out of ideas and we tried literally everything we could think of. Everything in the troubleshooting guide didn't help. It seems like an issue with the replication instance, but why are other tasks on that instance not affected? Same can be seid for the source and the target. As nothing else changed why is this happening suddenly?

Any help would be highly appreciated. Thanks in advance for any response here.

Update: Rolling back to 3.4.7 seems to work fine. Already sent to AWS support.

IT-BABA
asked 8 months ago146 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions