Skip to content

DMS Serverless Migration Errors on scale-up migrating to Aurora Postgres

0

I'm using a Serverless migration (Full load and change data capture) to move from Oracle on-prem to Aurora Postgres. After a bunch of tuning, I've got it close to ready. However, I noticed that immediately after the serverless replication attempts to scale-up (about 30 minutes into the migration), the task goes into a "Running with Errors" status.

Scaling Uprunning with errors

Looking at the Aurora PG logs, I see that its running into "Duplicate key violates unique key" errors in target PG DB. I Didn't see this error using DMS Instances, and there are definitely NOT duplicate values in the source DB. It took some time to find the failure pattern, but its definitely immediately after scaling.

The target DB is running a db.r6g.2xlarge writer instance, and the CPU was running around 60% utilization prior to the scale event. The migration is configured to 1-16 DCUs, and when it failed, it was in the process of scaling from 2 to 4 DCU.

For now, I'm going to revert to FIXED DCU to avoid scaling, but I'd REALLY like to use scaling in the future.

I'll award meaningless hypothetical bonus points if anyone can tell me how to recover from "Running with Errors" without erasing the target DB and starting the migration over again! 😬 I'm definitely concerned about forward migrations in production running into these sort of errors and I won't be able to recover.

Thanks

  • hello there. oracle and postgres are two different databases. did you make use of a schema conversion tool?

  • @Phil, yes, sort of. SCT worked sort of. It kept hanging applying the changes, but I was able to dump the converted SQL/DDL which was even better since I have a GitHub action I call to erase my DB and rebuild from scratch every time DMS craps out on me.

3 Answers
0

Hey Brien, Looks like you have already spent a quite a bit of time troubleshooting the issue. Let's see in different angle now. Your description indicates DMS works little different when you use serverless over provisioned DMS configuration. It's worth checking DMS logs as well . Please enable cloudwatch log for the DMS migration task if not done already and check the all the log entries just before the time of time of errors in Aurora PG logs. This will give some useful information for root cause of the errors. Check the below document: https://repost.aws/knowledge-center/dms-task-error-status

Also as per the below AWS document, duplicate records on the target table is expected while running the Full load and CDC. In your case since the primary key is enabled on the target DMS is erroring out. Despite these table errors, let the job run (I assume it is not failing) and check the manually compare and validate the records on source and target tables. https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Troubleshooting.html#CHAP_Troubleshooting.General.DuplicateRecords

answered 2 years ago
  • I have done a fair amount of troubleshooting. I had logging on, set to debug. Thats how I knew the problem. There was nothing fishy in the DMS logs, but found the corresponding logs in Aurora that stated the client (ie DMS Serverless Task) had dropped the connection. After scaling, it tried to send another file, and thats when it hit the duplicate rows, on all the tables that had been loading.

    You're right that it didn't stop. But it did not, and could not, resume on the "Table Error" tables. It finished loading, and entered replication mode, but the errored tabled were unrecoverable.

  • If you (or anyone) knows how to recover a job stuck in "Running with Errors", I'd love to hear the procedures. Nothing I did could get the migration to resume on those tables. DMS just seems to drop "Table Error" tables on the floor and pretend like they don't exist. No more rows ever go in, and options to "Re-Validate" and "Reload" the tables never become available. The ONLY path forward I could find was to destroy and rebuild ALL the underlying tables, and restart the migration everything completely from scratch. A FRIGHTENING proposition for production workloads!

0

When I had Table Errors, eventually I found the cause only in the RDS' Logs & Events page. Turns out it couldn't load many because of foreign key constraints. So I extracted all FKs, dropped them (explicitely), recreated after and no more Table Errors (when starting all scratch). But I was doing Full Load, without CDC (which I imagine can bring other DB consistency errors in our target)

answered a year ago
0

I had this same scaling "duplicate key" error (there are no duplicate keys in the data since the source has the same PK constraint) when loading to Aurora Serverless PostGres DB. I was able to work around the error by dropping the PK constraint before the DMS task is run and re-creating the PK after. It takes some time to recreate the index for 95M rows, but job runs overnight so not a concern.

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.