The below article discusses about the steps/action plan involved when troubleshooting an error related to replication on the MGN/DRS dashboard
After installing the replication agent, there are certain checks that the service performs before initiating the data transfer.
The following are the troubleshooting plan for each replication step :
1. Replication initiation steps fails at step "Create security groups"
2. Replication initiation steps fails at step "Launch replication server"
3. Replication initiation steps fails at step "Authenticate with Service"
4. Replication initiation steps fails at step "Download replication software"
5. Replication initiation steps fails at step "Create staging disks"
6. Replication initiation steps fails at step "Attach staging disks"
7. Replication initiation steps fails at step "Pair replication server with Agent"
Resolution :
1. Replication initiation steps fails at step "Create security groups"
Error : Data replication stalled
Failed to create security group.
This error can occur when the service is in the process of launching a Replication Server but was unable to create a Security Group. This error can also occur when the subnet selected in replication setting for launching the Replication Servers may no longer exist
- Check cloudtrail event history for API : CreateSecurityGroup
- Check for IAM user permissions used to install the replication agent. Also, check IAM roles for MGN/DRS
- Check if the subnet you selected in Replication Settings is correct or still exists
2. Replication initiation steps fails at step "Launch replication server"
Error : Data replication stalled
Failed to launch replication server.
This error occurs if the service is not able to launch an EC2 instance for the Replication Server.
- Check cloudtrail event history for API : RunInstances and look for error in the event details
- Check for any IAM related issues or instance launch failure due to explicit deny in SCPs
- Check and verify that the subnet and VPC are configured correctly in the account. Also, ensure the replication settings are configured correctly.
3. Replication initiation steps fails at step "Authenticate with Service"
Error : Data Replication Stalled
Authenticate with Service
This error occurs if there is connectivity issue between the staging subnet and the DRS/MGN endpoint on port 443 (HTTPS)
- Check network configuration for the replication server (SG, NACLs, Route table, DNS, DHCP Option set, Route 53)
- Check for AccessDenied error in Cloudtrail for API such as SendClientLogsforMgn/ SendClientLogsForDrs and GetChannelCommandsForMgn/ GetChannelCommandsForDrs
- Check the MGN/DRS VPC Interface Endpoints, it is important to confirm that the VPC endpoint policy allows access to these endpoints.
- Check for any permission related issue caused by SCPs
- Perform connectivity test from staging subnet to the endpoints
Powershell commands for windows -
For MGN :
For DRS :
Linux -
For MGN :
For DRS :
[+] https://docs.aws.amazon.com/mgn/latest/ug/preparing-environments.html#Communication-TCP-443-Staging
[+] https://docs.aws.amazon.com/drs/latest/userguide/Network-Requirements.html#Communication-TCP-443-Staging
4. Replication initiation steps fails at step "Download replication software"
Data Replication Stalled
Failed to download replication software
This error occurs if there is connectivity issue between the staging subnet and the S3 endpoint on port 443 (HTTPS)
-
Check network configuration for the replication server (SG, NACLs, Route table, DNS, DHCP Option set, Route 53)
-
Check the S3 VPC Interface Endpoints, it is important to confirm that the VPC endpoint policy allows access to the endpoint.
-
Check if there are there any Service Control Policies (SCP) that could be blocking the API request.
-
Perform connectivity test from staging subnet to the endpoints
Powershell commands for windows :
Linux:
[+] https://docs.aws.amazon.com/mgn/latest/ug/preparing-environments.html#TCP-443
[+] https://docs.aws.amazon.com/drs/latest/userguide/Network-Requirements.html#TCP-443
5. Replication initiation steps fails at step "Create staging disks"
Data Replication Stalled
Failed to create staging disks
The above error occurs if the service fails to create the EBS volume for staging environment. It can be due to following reasons :
- Check CloudTrail logs for any errors in the CreateVolume API call.
- Check if the EBS service quota is reached
- Check Replication settings if you are using the Default Amazon EBS volume encryption key or a customer managed key (CMK). If CMK, ensure that the IAM role has permissions for the KMS key.
6. Replication initiation steps fails at step "Attach staging disks"
Data Replication Stalled
Failed to attach staging disks
The above error may occur if the IAM identity does not have the permissions to the selected KMS key or policy restricting the AttachVolume API.
- Check CloudTrail logs for any errors in the AttachVolume API call.
- Check Replication settings if you are using the Default Amazon EBS volume encryption key or a customer managed key (CMK)
- Check the KMS Key Policy for any statements that may prevent AWS MGN from using the selected KMS key. Verify by checking the Encryption Key and its permissions under Key Users or Key Administrators.
7. Replication initiation steps fails at step "Pair replication server with Agent"
Data Replication Stalled
Failed to pair replication server with replication agent
- Check the communication over port 1500 exists between source and replication server. Verify SG, NACLs, Route Table
For windows :
- tnc <replication server IP> -port 1500
- netstat -ano | findstr "1500"
For Linux :
-
netcat -vz <replication server IP> 1500
-
telnet <replication server IP> 1500
-
netstat -an | grep “1500”
-
check replication agent logs :
Linux: /var/lib/aws-replication-agent/agent.log.0
Windows : C:\Program Files (x86)\AWS Replication Agent\agent.log.0
[+] https://docs.aws.amazon.com/mgn/latest/ug/preparing-environments.html#Communication-TCP-1500
[+] https://docs.aws.amazon.com/drs/latest/userguide/Network-Requirements.html#Communication-TCP-1500