We have a task that migrates around 250 tables from an Aptible MySQL instance into Redshift that has been running fine for a month on the latest version (3.5.1.R705. It was running fine for a year on an earlier version). We were alerted by end users that data was missing from some tables. We successfully ran reloads of those tables, but as CDC went on for a couple hours, data started to go missing. Looking at the Redshift UI, we see some failed COPY commands which appear to occur in clusters, but not constantly (for instance, we haven't seen one for the past hour). Investigating Cloudwatch logs for the DMS task we see corresponding errors:
1698804134000,"2023-11-01T02:02:14:421823 [TARGET_APPLY ]D: RetCode: SQL_ERROR SqlState: XX000 NativeError: 30 Message: [Amazon][Amazon Redshift] (30) Error occurred while trying to execute a query: [SQLState XX000] ERROR: exception name : , error type : 100, message: Failed to parse error payload, should retry : 0 DETAIL: ----------------------------------------------- error: exception name : , error type : 100, message: Failed to parse error payload, should retry : 0 code: 30000 context: query: 239776237 location: xen_aws_credentials_mgr.cpp:422 process: padbmaster [pid=1073930812] ----------------------------------------------- [1022500] (ar_odbc_stmt.c:5007)"
Per the logs, it looks like these errors fail for both the manifest and the data. That hints at a Redshift issue, but the cluster appears to be fine during that time with plenty of headroom on CPU and disk and no contention. However we also then see in the DMS task logs
The target state saved. The last confirmed rec. id = 24385
where the id is being advanced later in the log after failures whereas I would expect it to remain the same. It looks like multiple application threads are writing to the logs at the same time so this may just be coming from a different thread. We are not seeing any log messages indicating a RECOVERABLE_ERROR. The task error configuration is below:
"ErrorBehavior": {
"FailOnNoTablesCaptured": false,
"ApplyErrorUpdatePolicy": "LOG_ERROR",
"FailOnTransactionConsistencyBreached": false,
"RecoverableErrorThrottlingMax": 1800,
"DataErrorEscalationPolicy": "SUSPEND_TABLE",
"ApplyErrorEscalationCount": 0,
"RecoverableErrorStopRetryAfterThrottlingMax": false,
"RecoverableErrorThrottling": true,
"ApplyErrorFailOnTruncationDdl": false,
"DataTruncationErrorPolicy": "LOG_ERROR",
"ApplyErrorInsertPolicy": "LOG_ERROR",
"EventErrorPolicy": "IGNORE",
"ApplyErrorEscalationPolicy": "LOG_ERROR",
"RecoverableErrorCount": -1,
"DataErrorEscalationCount": 0,
"TableErrorEscalationPolicy": "STOP_TASK",
"RecoverableErrorInterval": 5,
"ApplyErrorDeletePolicy": "IGNORE_RECORD",
"TableErrorEscalationCount": 0,
"FullLoadIgnoreConflicts": true,
"DataErrorPolicy": "LOG_ERROR",
"TableErrorPolicy": "SUSPEND_TABLE"
}
Any insight that can be lent here would be greatly appreciated.
You should open a support case for this - https://console.aws.amazon.com/support. Support will be able to help you understand what is happening here.