Skip to content

How do I troubleshoot replication lag or a backlog on my Windows source server for Application Migration Service?

8 minute read
2

I see a lag or backlog in my Windows source server when I use AWS Application Migration Service to replicate data.

Short description

You experience lag and backlog when you replicate data for the following reasons:

  • Slow network connection speed didn't allow the replication process to complete, or your bandwidth limited the amount of data that you can replicate.
  • Large spikes in new disk data caused a backlog that the AWS Replication Agent must send with the initial sync.
  • High read latency on the source server disks delayed disk replication.
  • High CPU, memory, I/O wait, or other resource usage caused replication bottlenecks.
  • You chose Amazon Elastic Block Store (Amazon EBS) staging volumes with low throughput or input/output operations per second (IOPS) and servers with limited network bandwidth. This causes latency and performance issues during replication.

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

Check the source server

Verify the source server status

Make sure that the source server for the migration is booted and running.

Verify that AWS Replication Agent processes are running

To list the running AWS Replication Agent services, run the following command from PowerShell:

get-service | where-object name -like "*AWSR*"

In the output, verify that AWSReplicationService is Running.

Example output:

PS C:\Users\Administrator> get-service | where-object name -like "*AWSR*"

Status   Name               DisplayName
------   ----               -----------
Running  AwsReplicationD... AwsReplicationDriverLogger
Running  AwsReplicationL... AwsReplicationLogger
Stopped  AwsReplicationP... AwsReplicationPostConvertService
Running  AwsReplicationS... AwsReplicationService
Running  AwsReplicationV... AwsReplicationVolumeUpdaterService

Or, press Windows + R, and then enter services.msc. Press Enter, and then verify that AWSReplicationService is Running.

Verify active TCP connections

Verify that there are five active TCP connections established with the replication server on TCP port 1500.

To check TCP port 1500, run the following command as an administrator:

netstat -an | find "1500"

Check the command output for the active connections.

Example output:

TCP    172.31.82.135:50929    Replicator Instance IP:1500    ESTABLISHED
TCP    172.31.82.135:50930    Replicator Instance IP:1500    ESTABLISHED
TCP    172.31.82.135:50931    Replicator Instance IP:1500    ESTABLISHED
TCP    172.31.82.135:50933    Replicator Instance IP:1500    ESTABLISHED
TCP    172.31.82.135:50934    Replicator Instance IP:1500    ESTABLISHED

Use Windows Resource Monitor to check the performance on the source server

The AWS Replication Agent operates on one CPU core at a time. If CPU usage is high on the core where the AWS Replication Agent is running, then data replication slows. To check your CPU usage, complete the following steps:

  1. Open the Task Manager, and then choose the Performance tab. Then, choose Open Resource Monitor.
    -or-
    Open the Control Panel, and then choose Administrative Tools. Then, choose Resource Monitor.
    -or-
    Run resmon.exe from the command line or PowerShell.
    -or
    Choose the Windows icon, and then enter resmon.exe.
  2. Check the CPU usage of the CPU core that the AWS Replication Agent is running on.
    If the CPU usage is high on that core, then investigate the process that consumes most of the CPU. If the agent uses at least 5% of the CPU, then verify that there's enough CPU available for the agent to perform the data replication.
  3. Check disk performance on the source server. Under Disk Activity, check the Write (B/sec) and Response Time metrics.
    If there's low read throughput on the source disk, then the agent reads and replicates less data. Note any increase in the disk read and disk write metrics.
    Note: The required bandwidth to transfer replicated data over TCP port 1500 is based on the write speed of the participating source server. It's a best practice to have a bandwidth that's at least the sum of the average write speed of all replicated source machines.
  4. Check the source server for a spike in write operations. Under Disk Activity, check the Write (B/sec) metric.
    As the workload changes, periodically check the disk performance to determine the I/O load. If the write throughput exceeds the provided amount of network throughput, then you experience replication lag.
  5. (Optional) Calculate the required bandwidth from the source server to the replication server.
    Note: If your source server is write heavy and writes more than the replication speed, then the backlog continues to grow.

Check replication speed and available bandwidth from source server to the staging area subnet

For information about how to run a speed test, see How can I perform an SSL connectivity and bandwidth test?

Check for a source server that shut down ungracefully

If a source server shuts down ungracefully, then the AWS Replication Agent rescans all the disks after the server reboots. As the AWS Replication Agent rereads the disks, the lag continuously grows until the agent completes the scan. For more information, see Which Windows and Linux OSs support no-rescan upon reboot?

To check how the source machine shut down, complete the following steps:

  1. Press Windows + R, and then enter eventvwr.msc.
  2. Press Enter.
  3. In the navigation pane, double-click Windows Logs to expand the options.
  4. Open the context (right-click) menu for System.
  5. Choose Filter Current Log.
  6. Choose the Event sources down arrow, and then choose USER32.
  7. For All Event IDs, enter 1074, and then choose OK. Now, the Event Viewer shows you a list of power off (shutdown) and restart Shutdown Type events.
  8. To see the dates and times of all unexpected computer shutdowns, enter 6008 in the All Event IDs field, and then choose OK.

Verify that you didn't block outbound TCP port 1500 traffic

To confirm that outbound TCP port 1500 traffic from the source server to the replication server isn't blocked, run one of the following commands:

From CMD, run the following command:

telnet replication-subnet-IP-address 1500

From PowerShell, run the following command:

TNC replication-subnet-IP-address -port 1500

Note: Replace replication-subnet-IP-address with your replicator instance IP address.

Make sure that your local firewall allows connectivity from the source server to the replication server over TCP port 443. To activate connectivity on the operation system (OS) firewall, complete the following steps:

  1. On the source server, open the Windows Firewall console.
  2. Choose Outbound Rules.
  3. In the Outbound Rules table, select the rule related to the remote port 1500 connection. Verify that the Enabled status is set to Yes.
  4. If the Enabled status of the rule is No, then open the context (right-click) menu for the rule. Then, select Enable Rule.

Make sure that your corporate firewall allows traffic over TCP port 1500.

Verify that bandwidth throttling is deactivated in the replication settings on the source server

Deactivate bandwidth throttling on the source server to keep enough bandwidth for data transfers from the source server to the staging area subnet. Bandwidth throttling can cause constant or stagnant lag growth because it limits the data replication from the source server to the replication server.

To check for bandwidth throttling, complete the following steps:

  1. Open the Application Migration Service console.
  2. Choose Settings.
  3. Under Data routing and throttling, select the replication template.
  4. Select Do not throttle bandwidth to allow replication to use the full available network capacity and reduce migration time.
    Note: When you select Throttle bandwidth, Application Migration Service artificially caps data transfer speeds. This creates a bottleneck that slows the replication process. Select this option only if you need to limit network usage for cost control or to protect resources for other critical applications.

Check the staging area resources

Verify that inbound TCP Port 1500 traffic isn't blocked

To confirm that the replication server security groups don't block inbound TCP port 1500 traffic, complete the following steps:

  1. Open the Amazon Elastic Compute Cloud (Amazon EC2) console.
  2. In the navigation pane, choose Security groups, and then select the security group that's attached to the replicator instance.
  3. Verify that the security group allows inbound TCP port 1500 traffic.

Analyze your staging resources

Check the replication instance and staging disk configuration for performance bottlenecks.

Check the snapshot quota in the destination Region

Make sure that your AWS account didn't exceed the snapshot quota in the replication server's AWS Region.

To check your snapshot quota in the Region, run the following get-service-quota AWS CLI command:

aws service-quotas get-service-quota --service-code ebs --quota-code L-309BACF6 --region regionexample --query "Quota.Value"  

Note: Replace regionexample with your Region.

Then, run the following describe-snapshots command to check the snapshots in the Region:

aws ec2 describe-snapshots --owner-ids self --region regionexample --query "length(Snapshots)"

Note: Replace regionexample with your Region.