Optimizing AWS DRS Failback Times with High Performance EBS Volumes

4 minute read
Content level: Advanced
1

The context of the article is the use case where customers use DRS as a solution to setup Disaster Recovery. The article talks about how the time taken for a failback operation (after a failover) can be reduced by the use of high performance EBS volumes

Introduction:

Often customers migrating to using AWS Disaster Recovery (DRS) as a solution to implement disaster recovery from other on-premises solutions such as VMware Site Recovery Manager find that the time required for failback can be longer than their on-premises solution. For example, during a proof of concept or a drill, the customer fails over from the on-premises server to the DRS recovery instance. When failing back, the time required to complete the failback depends on several factors. Apart from potential network path/throughput and write activity on the failed over server, an important factor to consider is the storage volume (EBS) type attached to the recovery instance

Problem Statement

As an example, during a proof of concept (POC) performed by a customer, for an on-premises server with 1.5 TB of storage volume, the failback time was around 28 hours. The setup included General Purpose SSD (gp3) for the volume, PrivateLink, a verified network path, a 1 Gbps Direct Connect connection, and a measured throughput of 800Mbps.

Even though end users are not impacted during the failback period as the failed back server is fully operational, the timing can be an issue for customers who have established testing procedures during their DR drills and failback. For example, end users may be waiting to test applications after failback is complete. There could also be specific timing requirements for completing failback during an actual disaster recovery event. Customers should take these factors into account when planning their failback procedures and timing. The failback process should align with any established testing plans and timing requirements. Clear communication of the failback schedule and procedures with end users is also important to avoid unexpected impacts.

Solution

One of the reasons for the longer failback timings could be the type of the EBS volume(s) attached to the recovery instance. During failback, a sync between the recovery instance storage and the source server is performed that will require reading the entire volume. Since the volumes attached to the replicate instance are created from a snapshot, scanning of the volume can be slow as the volumes need to be hydrated. That is, storage blocks must be copied from Amazon S3 and written to the volume before we can access them. This preliminary action takes time and can cause a significant increase in the latency of I/O operations the first time each block is accessed. Peak volume performance is achieved after all blocks have been downloaded and written to the volume.

Using higher performance EBS volume types like io1 or io2 on the recovery instance can significantly mitigate the hydration penalty and accelerate failback due to their lower latency and higher IOPS compared to gp2/gp3 volumes. In one example, simply switching the recovery instance's volume from GP3 to io1 reduced the failback time from 28 hours to just 3 hours. There are cost considerations to be taken in to account when using different volume types.

This volume(s) type can be changed by editing the default EC2 launch template for the newly added source servers as described at the link: https://docs.aws.amazon.com/drs/latest/userguide/default-ec2-launch-template.html

To change the volume type for an individual instance edit the EC2 template for a single servers: https://docs.aws.amazon.com/drs/latest/userguide/ec2-launch.html

Solution without changing EBS volume type

Alternatively, initializing the recovery instance volumes after creation from the snapshot can avoid the hydration delay during failback synchronization. However, the initialization process itself adds time overhead that should be considered.

The volume(s) can be initialized for Windows and Linux based instances following the procedures described: https://docs.aws.amazon.com/ebs/latest/userguide/ebs-initialize.html

profile pictureAWS
EXPERT
published 2 months ago68 views