DMS Serverless replication instance can't be started, reloaded or resumed

0

First, I had a configured replication that I used successfully last month. But reload of it today fails without explicit error, just Internal Failure (the endpoints, VPC, etc, didn't change since).

Created a new serverless replication and tried to start that. "Test connection failed for endpoint" (source). This happens randomly also in Schema Conversion though and gets fixed by itself when retrying.

Tried Reload and Resume on it. First times the error was just "Internal failure" (screen) Enter image description here No other details for it in CloudWatch logs or RDS logs. The replication has DEBUG levels set.

And on next attempts, I don't get any logged error at all, but it quickly fails with: "The target of the replication my-replication-qa failed to reload." Or: "The replication my-replication-qa failed to resume." CW log has just: {'replication_state':'initializing', 'message': 'Initializing the replication workflow.'} {'replication_state':'calculating_capacity', 'message': 'Calculating workload capacity for your replication based on table mappings and source metadata.'}

How can I find the exact problem it has ? Any tips, other logs ? I don't have Business Support.

  • If I insisted by making new instances, I was able to obtain one that after 5x "internal failure" in CW log does move past the metadata fetch issues, and either reads it or assumes a default. It provided capacity (2) and even started (but dummy, no real schema selected). But I can't edit this instance to put the real instance, every save fails with: "ServerlessInternalSettings is not a valid field."... And with other 5+ new instances, couldn't get one to start again. They either fail while computing resources or earlier at test connection (which worked before). So buggy...

asked 10 months ago997 views
1 Answer
0

The things that helped making it more stable (because it started to fail in different steps):

  • create VPC endpoint for Secrets Manager
  • set the exact (but all) security groups on the 2 RDS and on the serverless instances

But fetching metadata to determine the DCUs to use still failed, so I started to assume the schema is too big and the default memory and disk (100G?) is not enough to take (even if it did work twice in the past).

So next trick was to start it with Selection Rules on a single table. With it, it detected how many capacity units it wants (2) and started to provision them. My dummy table was loaded.

Then I planned to edit the rules to include all tables and use Resume or Reload. But editing the instance again fails with the error:

"Invalid settings json: ServerlessInternalSettings is not a valid field."

No such field present in the jsons that I can see in the edit GUI, so this issue looks unfixable. And it also seems there is no CLI support for serverless replications (?)

So this workaround fails.

Finally, executed awsdms_support_collector_oracle.sql to see what else it says, and solved all small issues like missing PKs, not null blobs, eliminated complicated objects like mat views & mat views logs. But it didn't make it go past the computation step.

answered 10 months ago
  • Eventually we did get Business Support and created a Case last week, but still no solution or further details about what's wrong were provided yet.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions