DMS CDC task (Oracle->S3, binary reader) "hangs" without failing, misses changes

1

We're running DMS on an on-prem Oracle database, with a destination to S3 (which we then load to Snowflake outside of DMS). We're finding the replication task will, seemingly at random times after working fine for a few hours, simply stop processing Oracle log files and so will report no source CDC changes, and will report zero latency (whereas the source latency generally fluctuates between 1 and 5s). Cloudwatch logs continue to show the heartbeat message (but few other messages do): [SORTER ]I: Task is running (sorter.c:736)

After turning on DEBUG logging, we're seeing the following messages that appear to be related to this problem:

2021-12-27T16:39:50:430405 [TASK_MANAGER    ]D:  There are 284 swap files of total size 1 Mb. Left to process 5 of size 1 Mb  (replicationtask_cmd.c:1639)

And this error, which seemingly pops up much more frequently, so including it here, but not sure if it's related:

2021-12-27T16:43:01:094411 [DATA_STRUCTURE  ]E:  SQLite general error. Code <19>, Message <UNIQUE constraint failed: events.identifier, events.eventType, events.detailMessage>. [1000506]  (at_sqlite.c:475)

We're running a dms.r5.large, and can't otherwise find any patterns about when/why the issue appears (again, without any warnings, other errors, etc.).

Restarting the task (stopping the task takes an unusually long time) fixes the problem and causes the task to "catch up" to where the updates stalled. Our current workaround is to set an alert looking for too-long a time of zero latency, and then having a lambda function stop and restart the task.

1 Answer
0

The SORTER is the main component in Change Data Capture (CDC) that routes the changes captured from the source to the target and is responsible for:

  • Synchronizing Full Load and CDC changes
  • Deciding which events to apply as cached changes
  • Storing the transactions that arrive from the source database until they are committed, and sending them to the target database in the correct order (i.e. by commit time)

[SORTER ]I: Task is running and the other messages like UNIQUE constraint failed are not enough to determine the root cause. I would recommend to open Support case if the issue persist; the DMS support would work with you to debug the sorter component to determine the actual root cause of the hang, missing changes and latency.

AWS
Eli DOE
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions