- Newest
- Most votes
- Most comments
Hey Brien, Looks like you have already spent a quite a bit of time troubleshooting the issue. Let's see in different angle now. Your description indicates DMS works little different when you use serverless over provisioned DMS configuration. It's worth checking DMS logs as well . Please enable cloudwatch log for the DMS migration task if not done already and check the all the log entries just before the time of time of errors in Aurora PG logs. This will give some useful information for root cause of the errors. Check the below document: https://repost.aws/knowledge-center/dms-task-error-status
Also as per the below AWS document, duplicate records on the target table is expected while running the Full load and CDC. In your case since the primary key is enabled on the target DMS is erroring out. Despite these table errors, let the job run (I assume it is not failing) and check the manually compare and validate the records on source and target tables. https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Troubleshooting.html#CHAP_Troubleshooting.General.DuplicateRecords
I have done a fair amount of troubleshooting. I had logging on, set to debug. Thats how I knew the problem. There was nothing fishy in the DMS logs, but found the corresponding logs in Aurora that stated the client (ie DMS Serverless Task) had dropped the connection. After scaling, it tried to send another file, and thats when it hit the duplicate rows, on all the tables that had been loading.
You're right that it didn't stop. But it did not, and could not, resume on the "Table Error" tables. It finished loading, and entered replication mode, but the errored tabled were unrecoverable.
If you (or anyone) knows how to recover a job stuck in "Running with Errors", I'd love to hear the procedures. Nothing I did could get the migration to resume on those tables. DMS just seems to drop "Table Error" tables on the floor and pretend like they don't exist. No more rows ever go in, and options to "Re-Validate" and "Reload" the tables never become available. The ONLY path forward I could find was to destroy and rebuild ALL the underlying tables, and restart the migration everything completely from scratch. A FRIGHTENING proposition for production workloads!
When I had Table Errors, eventually I found the cause only in the RDS' Logs & Events page. Turns out it couldn't load many because of foreign key constraints. So I extracted all FKs, dropped them (explicitely), recreated after and no more Table Errors (when starting all scratch). But I was doing Full Load, without CDC (which I imagine can bring other DB consistency errors in our target)
I had this same scaling "duplicate key" error (there are no duplicate keys in the data since the source has the same PK constraint) when loading to Aurora Serverless PostGres DB. I was able to work around the error by dropping the PK constraint before the DMS task is run and re-creating the PK after. It takes some time to recreate the index for 95M rows, but job runs overnight so not a concern.

hello there. oracle and postgres are two different databases. did you make use of a schema conversion tool?
@Phil, yes, sort of. SCT worked sort of. It kept hanging applying the changes, but I was able to dump the converted SQL/DDL which was even better since I have a GitHub action I call to erase my DB and rebuild from scratch every time DMS craps out on me.