- Newest
- Most votes
- Most comments
Hi Mike so i hear you want know how does DMS handle duplicate data.
As you asks if a Full Load is stopped/paused and then it is resumed, what happens? Well the answer is that Full load is not designed to handle a resume. It either has to finish in one go or fail. So if you are going to stop/pause it while it processes, it will start all over again. This understanding can be drawn from this wording "When using a single AZ or Multi-AZ replication instance during a FULL LOAD and a failover or host replacement occurs, the full load task is expected to fail. You can restart the task from the point of failure for the remaining tables that didn't complete, or are in an error state." and the doc is here[1]https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.FullLoad.html
With CDC, if you stop/pause it, it will make a checkpoint and when you resume it later, it will continue. When you stop a replication task, DMS sets a checkpoint so you can resume replication from the exact place in the transaction log[1]https://aws.amazon.com/about-aws/whats-new/2018/06/aws-dms-can-start-replication-anywhere-in-a-transaction-log/
Okay so now that we have the above clear, DMS task has settings to handle how it will apply data on the target. These are:
DO_NOTHING – Data and metadata of the existing target table aren't affected.
DROP_AND_CREATE – The existing table is dropped and a new table is created in its place.
TRUNCATE_BEFORE_LOAD – Data is truncated without affecting the table metadata. if a DMS tasks fails and restarts, if you have DROP_AND_CREATE or TRUNCATE_BEFORE_LOAD as your settings, it will not cause duplicate values ever. Only DO_NOTHING will cause duplicate data after pause/restart. Doc for these settings is[2]https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.FullLoad.html
so Mike, lastly when it comes to primary keys, DMS uses them when applying data to target in a faster way. They do not guarantee uniqueness because if you have a DMS task sending duplicate records, the primary key constraint will be violated (target database will block duplicate records basically). So only way you can really block duplicate data is if they use DROP_AND_CREATE or TRUNCATE_BEFORE_LOAD as DMS task settings
You can set the error handling behavior of your replication task during change data capture (CDC) using the following setting
FullLoadIgnoreConflicts – Set this option to true to have AWS DMS ignore "zero rows affected" and "duplicates" errors when applying cached events. If set to false, AWS DMS reports all errors instead of ignoring them. The default is true.
Duplicate records occur on a target table without a primary key.
the Primary Key is required to be able to uniquely identify different row. Running a full load and CDC task can create duplicate records on target tables that don't have a primary key or unique index. To avoid duplicating records on target tables during full load and CDC tasks, make sure that target tables have a primary key or unique index.
Relevant content
- Accepted Answerasked a year ago
- asked a year ago
- asked a year ago
- AWS OFFICIALUpdated 4 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 8 months ago
Thanks for the response! I've got a couple questions, but the first is in regards to this comment:
"As you asks if a Full Load is stopped/paused and then it is resumed, what happens? Well the answer is that Full load is not designed to handle a resume. It either has to finish in one go or fail. So if you are going to stop/pause it while it processes, it will start all over again. This understanding can be drawn from this wording "When using a single AZ or Multi-AZ replication instance during a FULL LOAD and a failover or host replacement occurs, the full load task is expected to fail. You can restart the task from the point of failure for the remaining tables that didn't complete, or are in an error state." and the doc is here[1]https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.FullLoad.html"
Doesn't that statement contradict itself? You say that a Full Load Task has only one of two states - completion or failure, with no resume capability. Yet in the quote, it says "You can restart the task from the point of failure for the remaining tables that didn't complete, or are in an error state."
Any chance of clarification on that?