Application Migration Service (MGN) - Replication Randomly Stalls


We're having issues with MGN replication intermittently becoming Stalled and snapshots being days old due to sync lag. More details below.

We're using MGN (agent based) to migrate on prem servers to AWS. We're replicating anywhere between 10 and 30 source servers at a time. At any given time there are 1-5 servers in a "stalled" state. This happens after the servers have been successfully replicating for days. No changes are made to the source or destination environments when this occurs. Usually this resolve on it's own without any intervention on our part and the servers start reporting as healthy again. However it doesn't always resolve on it's own and ideally shouldn't be occurring at all. Any idea what could be causing this or where to start troubleshooting? It's starting to impact our test and cut over procedures as we can't always launch instances from snapshots that are hours or days old.

1 Answer

This issue with MGN replication becoming stalled and snapshots being out of date could be caused by a number of factors. Some possible causes include network connectivity issues between the source and destination environments, insufficient resources on the destination server, or problems with the MGN agent on the source server.

To troubleshoot this issue, you may want to start by checking the logs on the MGN agent on the source server to see if there are any error messages or warnings that could provide insight into the problem. You can also check the resource utilization on the destination server to ensure that it has enough CPU, memory, and storage to handle the replication workload.

Another step could be to check the network connectivity between the source and destination environments, to ensure that there are no issues that are preventing the replication process from completing. If possible you can try testing connectivity and bandwidth between the two environments.

You may also want to consider increasing the frequency of your snapshots to ensure that they are more up-to-date in case of issues with replication.

You can also try to test replication with a single server and monitor its behavior to check if the issue is consistent or it is only happening with a specific server.

profile picture
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions