AWS Elastic Disaster Recovery and performing custom failback with DRS Mass Failback Automation Client (DRSFA Client)
This is an example of how AWS DRSA Client could be used for on-prem failback with additional customization.
AWS Elastic Disaster Recovery(AWS DRS) gives customers the capability of reliable cost effective disaster protection of Cloud or On-prem workloads. AWS DRS is an agent-based replication service for protecting and recovering your VM's and Physical servers to AWS as EC2 instances in the event of a disaster. Most customers have the requirement to test their DR capability at least once per year. The DR test may require workloads to run out of the DR site, like AWS, for an extended period of time. The test could last for multiple days, or even weeks, at full production capacity. Once the tests are complete, customers also need the capability to 'fail back' to their on-prem environment to continue with normal operations.
AWS DRS has features that help customers perform a failback. When failback requirements consist of workloads outside of AWS such as on-prem workloads running vSphere, Hyper-V and KVM or even other clouds as examples, customers can use AWS DRS Failback ISO client to reverse replication from EC2 back to original environment. The Failback User guide can be found here. It is a straight forward process with easy-to-follow directions for a small set of workloads when looking to failback individual VMs.
When customers need to failback at scale, AWS DRS has a feature called DRS Mass Failback Automation Client (DRSFA Client). It helps to failback many VMs at the same time. The default operation of the failback client is to failback all workloads with corresponding recovery instance launched. A common use case would be to failback after a disaster recovery event to a customer’s production environment where they most likely run VMware vSphere infrastructure. Recently, I was asked by a VMware customer – What if I need to failback just a few VMs? What would be the procedure to do this? Failback VMs one by one with Failback ISO client seems operationally inconvenient – can DRSFA client help here (it has ‘Automation’ in its name – sound promising!)?
AWS is always listening to our customers for ways to provide more capabilities. AWS has provided a feature that can help with this called DRS Mass Failback Automation or DRSFA. It has an option to perform custom failbacks, such as failing back only a portion of actual launched VMs. It can perform failback to original on-prem VMs (which were used as replication source) - or to a newly created VMs if original VMs has gone.
First I strongly recommend to take a look at AWS DRS official documentation and a great blog post to get a better understanding of AWS DRS and its requirements.
In this article I will review two failback scenarios. In the first scenario, I will failback to the original on-prem virtual machine. In the second scenario, I will perform a recovery to a new VM, which simulates a scenario in which the original VM no longer exists in the production site.
As part of this testing, I performed the pre-requisite of installing the DRSFA client to workloads in my lab. The DRSFA installation guide can be found here.
Scenario 1: Failback to the original existing on-prem virtual machine.
Step 1: We need to create a default configuration for DRSFA following the client's prompt:
Welcome to the DRS Mass Failback Automation CLI
What would you like to do?
1. One-Click Failback
2. Perform a Custom Failback
3. Generate a default failback configuration file
4. Find servers in vCenter
5. Help
6. Exit
Enter a number between 1-6: 3
Enter a custom prefix for the configuration file name: default-failback
Default failback configuration file was created at: /home/apylnev/drs_failback_automation_client/Configurations/default-failback_us-east-2.json
Step 2: Lets review the configuration file and break down the different parts and their functions. This configuration file contains 2 on-prem VMs protected by AWS DRS, each VM has its own configuration {} block. Section one: the networking configuration. Here you need the IP address for the VM where reverse replication would be going. Your corresponding recovery instance would connect to that IP address (on a contrast when AWS DRS replication server handles replication process from protected resources to AWS). In my lab i use DHCP. Section two: ‘RECOVERY_INSTANCE_ID’ already populated for both of them. This means recovery instances for those 2 VMs have been launched already. To further elaborate, we need to have recovery instances launched to perform a failback. Another important aspect of 'RECOVERY_INSTANCE_ID' would be that each time you perform a cycle of ‘launch recovery instance/terminate recovery instance’ this ID would be different and unique. What does this mean, it means you can’t have this configuration file prepared in advance and stored for the future use. We will need to update 'RECOVERY_INSTANCE_ID' accordingly each time after launch a new recovery instance. Section three: 'DEVICE_MAPPING' allows you to define the exact disks that need to be replicated back on-prem:
There are no friendly names of protected resources, this includes no hostname or VM names. We only have IDs to work with. In scenario 1, I have 2 recovery instances launched in my lab, I’d like to failback only one of them. Virtual machine ‘apylnev-web-d-1’ is my target failback. Using AWS DRS console I can get idea how to identify it in DRSFA config file:
The ID's below in yellow match the 'RECOVERY_INSTANCE_ID' AND 'VCENTER_ORIGINAL_SOURCE_SERVER_UUID' in my config file. Now I need to create a custom config file for the VM I need to failback – in this case I’m going to copy the default config file and remove the second block, which gives me the following:
The idea of custom failback is plain simple – leave blocks for only VMs you need to failback (and follow JSON format).
Step 3: Now my custom configuration is ready for launch:
Welcome to the DRS Mass Failback Automation CLI
What would you like to do?
1. One-Click Failback
2. Perform a Custom Failback
3. Generate a default failback configuration file
4. Find servers in vCenter
5. Help
6. Exit
Enter a number between 1-6: 2
Select an option from the list below:
1. Use a configuration file from a custom path
2. My configuration file is under home/apylnev/drs_failback_automation_client/Configurations/
Enter a number between 1-2: 2
Select a custom configuration file to use:
1. apylnev-web-d-1.json
2. default-failback.json
Enter a number between 1-2: 1
Enter a custom prefix for the results output: custom-failback
The following Recovery instances will be failed back to their original VMs:
i-0a14c2f006fe1151a
Would you like to continue? (Y/N): Y
Initiating failback for account in region us-east-2
17:00:04: 1 total servers. 0 currently replicating, 1 initiating replication, 0 skipped, 0 failed
17:01:06: 1 total servers. 0 currently replicating, 1 initiating replication, 0 skipped, 0 failed
17:02:07: 1 total servers. 0 currently replicating, 1 initiating replication, 0 skipped, 0 failed
17:03:09: 1 total servers. 1 currently replicating, 0 initiating replication, 0 skipped, 0 failed
Results exported to /home/apylnev/drs_failback_automation_client/Results/Failback/custom-failback_us-east-2_2024-10-25 17:03:09.061178
In this log I can see that failback replication had started for my single instance. Now I can go back to AWS DRS console to check my failback job:
When the reverse replication is complete, I can click on "Complete Failback" to finalize the process:
Scenario 2: What if we want to failback to a brand new VM (ie I don’t want to overwrite the original VM) or the original VM is gone and we can’t restore it on-prem from backup. How can DRSFA can handle those situations?
Step 1: Let’s review the DRSFA configuration file i prepared in the first scenario: The client uses the "VCENTER_TARGET_SERVER_UUID" field to identify the VM it will failback to. The brand new VM would have a new UUID which means I need to update my config file with that new UUID. The "VCENTER_TARGET_SERVER_UUID" field references the 'VM.config.UUID', and we can use PowerCLI to get this for us:
$VM = get-view –viewtype VirtualMachine –filter @{“Name”=”apylnev-web-d-1”}
$VM.config
Now that I have the UUID for the new VM, I can replace the old UUID with this new one in my config file. One thing to note here, the disks attached to the new VM must be the same size, or larger, than the original VM - otherwise the replication will fail.
$VM = get-view –viewtype VirtualMachine –filter @{“Name”=”apylnev-web-d-failback”}
$VM.config.Uuid
Step 2: When performing a fallback, you are able to select the disks that will be replicated, using the “DEVICE_MAPPING” field in the config file. In my first scenario this field was set to ‘Automatic’, which works great when failing back to the original VM. However, failing back to a new VM requires that you explicitly map the drives from the recovery instance (EC2 instance in this case) to the new VMs disk layout. Here is a detailed document on how you need to do it. An important thing to note here is that you cannot assume that the disk names will match your original VM’s disk names. In the next screenshot, I’m using the ‘ls -l /dev/disk/by-path/‘ trick to check the disk names of my recovery instance:
ls -l /dev/disk/by-path/
total 0
lrwxrwxrwx 1 root root 13 Oct 28 15:55 pci-0000:00:04.0-nvme-1 -> ../../nvme0n1
lrwxrwxrwx 1 root root 15 Oct 28 15:55 pci-0000:00:04.0-nvme-1-part1 -> ../../nvme0n1p1
lrwxrwxrwx 1 root root 15 Oct 28 15:55 pci-0000:00:04.0-nvme-1-part2 -> ../../nvme0n1p2
lrwxrwxrwx 1 root root 15 Oct 28 15:55 pci-0000:00:04.0-nvme-1-part3 -> ../../nvme0n1p3
My recovery instance has a single disk named 'nvme0n1' with 3 partitions. Now I can use ‘/dev/nvme0n1’ in the DEVICE_MAPPING parameter in my configuration file. Here's what my new config file looks like, with the new UUID and disk mapping:
Step 3: Now I can perform a custom failback to my new VM.
Welcome to the DRS Mass Failback Automation CLI
What would you like to do?
1. One-Click Failback
2. Perform a Custom Failback
3. Generate a default failback configuration file
4. Find servers in vCenter
5. Help
6. Exit
Enter a number between 1-6: 2
Select an option from the list below:
1. Use a configuration file from a custom path
2. My configuration file is under /home/apylnev/drs_failback_automation_client/Configurations/
Enter a number between 1-2: 2
Select a custom configuration file to use:
1. default-fb_us-east-2.json
2. custom-fb.json
Enter a number between 1-2: 2
Enter a custom prefix for the results output: custom-new
The following Recovery instances will be failed back to their original VMs:
i-0732b0db01349b724
Would you like to continue? (Y/N): Y
Initiating failback for account in region us-east-2
16:02:57: 1 total servers. 0 currently replicating, 1 initiating replication, 0 skipped, 0 failed
16:03:59: 1 total servers. 0 currently replicating, 1 initiating replication, 0 skipped, 0 failed
16:05:00: 1 total servers. 0 currently replicating, 1 initiating replication, 0 skipped, 0 failed
16:06:01: 1 total servers. 0 currently replicating, 1 initiating replication, 0 skipped, 0 failed
16:07:02: 1 total servers. 1 currently replicating, 0 initiating replication, 0 skipped, 0 failed
Results exported to /home/apylnev/drs_failback_automation_client/Results/Failback/custom-new_us-east-2_2024-10-28 16:07:02.722363
Wait until the replication has completed and then click on 'Complete failback' to finalize the process.
Conclusion
I hope this blog post will help you to better understand the capabilities of AWS DRSFA and how it can be used in different scenarios. AWS DRSFA can be used to perform fallback for a group of VMs, either back to the original VM, or to newly created VMs. It simply requires you to properly identify the required workloads (VMs) and gather the required parameters for our config file. This is something that could even be scripted/automated with various tools for DR drills/tests.
Relevant content
- asked 8 months agolg...
- asked 3 years agolg...
- AWS OFFICIALUpdated 5 months ago