DMS CDC task (Oracle->S3, binary reader) "hangs" without failing, misses changes
We're running DMS on an on-prem Oracle database, with a destination to S3 (which we then load to Snowflake outside of DMS). We're finding the replication task will, seemingly at random times after working fine for a few hours, simply stop processing Oracle log files and so will report no source CDC changes, and will report zero latency (whereas the source latency generally fluctuates between 1 and 5s). Cloudwatch logs continue to show the heartbeat message (but few other messages do):
[SORTER ]I: Task is running (sorter.c:736)
After turning on DEBUG logging, we're seeing the following messages that appear to be related to this problem:
2021-12-27T16:39:50:430405 [TASK_MANAGER ]D: There are 284 swap files of total size 1 Mb. Left to process 5 of size 1 Mb (replicationtask_cmd.c:1639)
And this error, which seemingly pops up much more frequently, so including it here, but not sure if it's related:
2021-12-27T16:43:01:094411 [DATA_STRUCTURE ]E: SQLite general error. Code <19>, Message <UNIQUE constraint failed: events.identifier, events.eventType, events.detailMessage>.  (at_sqlite.c:475)
We're running a dms.r5.large, and can't otherwise find any patterns about when/why the issue appears (again, without any warnings, other errors, etc.).
Restarting the task (stopping the task takes an unusually long time) fixes the problem and causes the task to "catch up" to where the updates stalled. Our current workaround is to set an alert looking for too-long a time of zero latency, and then having a lambda function stop and restart the task.
I am having the same problem (the SQLite error) with MySql -> Dynamo. I saw a post (https://forums.aws.amazon.com/thread.jspa?threadID=324651) suggesting that this was "fixed" but creating a new replication instance and task with an updated version (3.3.4), but I'm on a later version.
The SORTER is the main component in Change Data Capture (CDC) that routes the changes captured from the source to the target and is responsible for:
- Synchronizing Full Load and CDC changes
- Deciding which events to apply as cached changes
- Storing the transactions that arrive from the source database until they are committed, and sending them to the target database in the correct order (i.e. by commit time)
SORTER I: Task is running and the other messages like UNIQUE constraint failed are not enough to determine the root cause. I would recommend to open Support case if the issue persist; the DMS support would work with you to debug the sorter component to determine the actual root cause of the hang, missing changes and latency.
Do source filters speed up DMS from AWS RDS to S3?asked 5 months ago
Questions on Data lake using DMSAccepted Answerasked 3 years ago
DMS CDC task (Oracle->S3, binary reader) "hangs" without failing, misses changesasked 5 months ago
Support for RDS Oracle as source while using DMS SCN for CDCAccepted Answerasked 4 years ago
DMS for Replicating Data in Oracle Application Containerasked 3 months ago
DMS task is creating full row duplicates in targetasked 2 months ago
Data ingestion from Oracle Cloud into S3Accepted Answerasked 4 years ago
DMS - Oracle Binary Reader LatencyAccepted Answerasked 4 years ago
S3 Integration Oracle RDS and DMSAccepted Answerasked 2 years ago
Oracle Migration from on-premises using DMS and RmanAccepted Answerasked 2 years ago