In AWS Elastic Disaster Recovery - how to fix "Start reversed replication" failure "Verify the agent is installed and running"?

0

We have performed a failover from DC1 to DC2 - everything ok.

In establishing replication back from DC2 to DC1 using "Start reversed replication" we have 1 server which gives the error i-03a15f3c315f0f18c: AWS Replication agent is not connected to DRS. Verify the agent is installed and running, and that it has connectivity to the service

We have followed the same procedure on other instances - reinstall the agent using download from the target region - and running "Start reversed replication" has worked.

With this 1 instance - the AWS EDR console in the Target environment - indicated syncing, snapshotting and now says Target Environment: Source Servers

While the source environment says Source Environment: Recovery Instance

I have tried re-installing the agent - which results in the same set of steps (sync, snapshot)

I am confused as the error indicates "not connected", but clearly connection has been made to get the instance registered on the Target side and synced etc.

DaveH
asked 4 months ago197 views
3 Answers
1
Accepted Answer

I made a mistake in

We have followed the same procedure on other instances - reinstall the agent using download from the target region - and running "Start reversed replication" has worked.

We have performed a cleanup in AWS EDR (disconnected and removed all recovery and source instances) and started from scratch. Now after full sync of source servers and "recovery".

This time not doing "reinstall the agent using download from the target region" (that was me not understanding the technology) and purely "Start Reversed Replication" has successfully established "Source servers" back in the primary region.

DaveH
answered 4 months ago
profile picture
EXPERT
reviewed 15 days ago
1

In certain cases, following an attempt to perform a reverse replication action, you will receive an error message indicating that the AWS Replication agent is not connected to AWS Elastic Disaster Recovery. In this case, verify that:

  1. The agent is installed and running
  2. The server is connected to the internet or the NAT gateway

Further,the recovery instance will require to allow an inbound connection on port 1500 from the source environment which can be via VPN/DX or via public IP.

Further, I request you to please check the following:

  • Check the routes to metadata connectivity on the Recovery Instance.
  • Restart the AWS Replication service.

Also, if the issues still persists, please get back to us with the latest agent log from the recovery instance on a support case.

  1. DRS Source ID (starting with s-xx) associated with the recovery instance.

  2. DRS agent logs located at (/var/lib/aws-replication-agent/agent.log.0)

  3. DR AWS Region

  4. What type of DRS setup in use - => AWS to AWS (cross Availability Zone or Cross Region): => On-premises to AWS:

  5. Confirm all the agent processes are running in recovered instance: $ ps aux | grep aws-replication

I can confirm that the recovery instance has "AWSElasticDisasterRecoveryRecoveryInstancePolicy" attached to the role "AWSElasticDisasterRecoveryRecoveryInstanceRole".

Hence, additional details are required further to troubleshoot the current issue.

Please refer:

[1] https://docs.aws.amazon.com/drs/latest/userguide/Troubleshooting-Failback-Errors.html [2] https://docs.aws.amazon.com/drs/latest/userguide/Network-Requirements.html

AWS
SUPPORT ENGINEER
Jeff_B
answered 4 months ago
profile picture
EXPERT
reviewed 15 days ago
profile picture
EXPERT
reviewed 4 months ago
0

Hi, thanks for the info/suggestions

When I checked this morning the agent was not running (noting from "ps aux | grep aws-replication") but I believe it must have run originally as the instance appears in the "Source Servers" of the original primary region.

I have restarted the agent ("sudo /var/lib/aws-replication-agent/runAgent.sh") and now the instance in "Source Servers" show's "Rescanning".

Can you possibly indicate how I check these:

  • Check the routes to metadata connectivity on the Recovery Instance
    - I have successfully run nc -zv X.X.X.X 1500 but do you mean something else?

  • Restart the AWS Replication service. - does this mean the Agent on the instance or some other region wide service?

Edit:

I am wondering if I misstepped by reinstalling the agents once they had "recovered" over to the second region. Should I have just done a "Start reverse replication"?

DaveH
answered 4 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions