DMS Random Termination

0

I have setup a CDC replication task using AWS Database Migration Service (DMS) capturing changes from a Postgres Database and writing them into Kinesis. This generally works fine but the DMS task seems to terminate randomly after some time (mainly during the night) without being able to resume again. An example log of such a "random" termination:

2023-07-30T22:01:43 [SOURCE_CAPTURE  ]I:  Heartbeat was signaled successfully  (postgres_endpoint_util.c:1785)
2023-07-30T22:06:43 [SOURCE_CAPTURE  ]I:  Heartbeat was signaled successfully  (postgres_endpoint_util.c:1785)
2023-07-30T22:24:34 [AT_GLOBAL       ]I:  Task Server Log - ais2-pdp-cdc-task  (V3.4.7.R905 ip-172-23-0-99 Linux 4.14.320-242.534.amzn2.x86_64 #1 SMP Wed Jul 12 19:43:51 UTC 2023 x86_64 64-bit, PID: 1311159) started at Sun Jul 30 22:24:34 2023  (at_logger.c:3051)
2023-07-30T22:24:34 [DATA_STRUCTURE  ]I:  SQLite version is 3.31.1  (at_sqlite.c:174)
2023-07-30T22:24:34 [VALIDATOR       ]I:  validation_util_class_initialize  (validation_util.c:71)
2023-07-30T22:24:34 [VALIDATOR       ]I:  Creating Table Def Mutex  (validation_util.c:75)

Note that between 22:06:43 - 22:24:34 nothing at all seems to be running, not even the heartbeat that should run all 5 minutes.

Further below DMS seems to try and resume the task but fails because the replication slot is already occupied by the earlier uncleanly terminated task:

2023-07-30T22:24:36 [SOURCE_CAPTURE  ]I:  Slot has plugin 'test_decoding'  (postgres_test_decoding.c:237)
2023-07-30T22:24:36 [SOURCE_CAPTURE  ]E:  Slot 'ais2_pdp_cdc_tas_00025600_ed91b868_07df_42c6_941d_2ebb04d30481' state found as 'already active' while expected as 'inactive'. [1020461]  (postgres_endpoint_capture.c:355)
2023-07-30T22:24:36 [TASK_MANAGER    ]I:  Task - ais2-pdp-cdc-task is in ERROR state, updating starting status to AR_NOT_APPLICABLE  (repository.c:5102)
2023-07-30T22:24:36 [SOURCE_CAPTURE  ]E:  Error executing source loop [1020461]  (streamcomponent.c:1873)
2023-07-30T22:24:36 [TASK_MANAGER    ]E:  Stream component failed at subtask 0, component st_0_VPO5NKIVXTDIZSNUG75H5RV2POSZ7O3FGQ4VQFY [1020461]  (subtask.c:1414)

I could not find anyone else with a similar problem. Is it a known issue? Has anyone used DMS successfully for a change data capturing from Postgres to Kinesis?

Below I am including some DMS Task configuration details that might be relevant for this issue:

 "StreamBufferSettings": {
        "StreamBufferCount": 3,
        "CtrlStreamBufferSizeInMB": 5,
        "StreamBufferSizeInMB": 8
    },
    "ErrorBehavior": {
        "FailOnNoTablesCaptured": true,
        "ApplyErrorUpdatePolicy": "LOG_ERROR",
        "FailOnTransactionConsistencyBreached": false,
        "RecoverableErrorThrottlingMax": 1800,
        "DataErrorEscalationPolicy": "SUSPEND_TABLE",
        "ApplyErrorEscalationCount": 0,
        "RecoverableErrorStopRetryAfterThrottlingMax": false,
        "RecoverableErrorThrottling": true,
        "ApplyErrorFailOnTruncationDdl": false,
        "DataTruncationErrorPolicy": "LOG_ERROR",
        "ApplyErrorInsertPolicy": "LOG_ERROR",
        "EventErrorPolicy": "IGNORE",
        "ApplyErrorEscalationPolicy": "LOG_ERROR",
        "RecoverableErrorCount": -1,
        "DataErrorEscalationCount": 0,
        "TableErrorEscalationPolicy": "STOP_TASK",
        "RecoverableErrorInterval": 5,
        "ApplyErrorDeletePolicy": "IGNORE_RECORD",
        "TableErrorEscalationCount": 0,
        "FullLoadIgnoreConflicts": true,
        "DataErrorPolicy": "LOG_ERROR",
        "TableErrorPolicy": "SUSPEND_TABLE"
    },
    "TTSettings": {
        "TTS3Settings": null,
        "TTRecordSettings": null,
        "EnableTT": false
    },
    "FullLoadSettings": {
        "CommitRate": 10000,
        "StopTaskCachedChangesApplied": false,
        "StopTaskCachedChangesNotApplied": false,
        "MaxFullLoadSubTasks": 8,
        "TransactionConsistencyTimeout": 600,
        "CreatePkAfterFullLoad": false,
        "TargetTablePrepMode": "DROP_AND_CREATE"
    },
    "TargetMetadata": {
        "ParallelApplyBufferSize": 100,
        "ParallelApplyQueuesPerThread": 1,
        "ParallelApplyThreads": 0,
        "TargetSchema": "",
        "InlineLobMaxSize": 0,
        "ParallelLoadQueuesPerThread": 1,
        "SupportLobs": true,
        "LobChunkSize": 64,
        "TaskRecoveryTableEnabled": false,
        "ParallelLoadThreads": 0,
        "LobMaxSize": 200,
        "BatchApplyEnabled": false,
        "FullLobMode": false,
        "LimitedSizeLobMode": true,
        "LoadMaxFileSize": 0,
        "ParallelLoadBufferSize": 0
    },
    "BeforeImageSettings": {
        "EnableBeforeImage": true,
        "ColumnFilter": "all",
        "FieldName": "before-image"
    },
    "ControlTablesSettings": {
        "historyTimeslotInMinutes": 5,
        "HistoryTimeslotInMinutes": 5,
        "StatusTableEnabled": false,
        "SuspendedTablesTableEnabled": false,
        "HistoryTableEnabled": false,
        "ControlSchema": "",
        "FullLoadExceptionTableEnabled": false
    },
    "LoopbackPreventionSettings": null,
    "CharacterSetSettings": null,
    "FailTaskWhenCleanTaskResourceFailed": false,
    "ChangeProcessingTuning": {
        "StatementCacheSize": 50,
        "CommitTimeout": 1,
        "BatchApplyPreserveTransaction": true,
        "BatchApplyTimeoutMin": 1,
        "BatchSplitSize": 0,
        "BatchApplyTimeoutMax": 30,
        "MinTransactionSize": 1000,
        "MemoryKeepTime": 60,
        "BatchApplyMemoryLimit": 500,
        "MemoryLimitTotal": 1024
    },
    "ChangeProcessingDdlHandlingPolicy": {
        "HandleSourceTableDropped": true,
        "HandleSourceTableTruncated": true,
        "HandleSourceTableAltered": true
    },
    "PostProcessingRules": null
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions