DR/DC strategy for AWS EC2 instances

0

Hi AWS, a broker service MQTT is running on an EC2 Ubuntu instance. The service is running in Active Active Mode. How to develop the Disaster Strategy for a scenario where if the primary instance goes down then the new instance will be created in place and traffic will be re-routed and it can afford from 5-10 minutes of downtime.

Can you also help me with the RPO and RTO for this? Also how I can leverage monitoring to get more detailed view in case of failover?

How to keep the backup till the last update happened on the failed instance to the newly created instance?

Please suggest

  • Are the EC2 instances behind a load balancer?

  • Yes the EC2 instances are behind the load balancer

profile picture
Arjun
asked a year ago1740 views
3 Answers
0

Before getting into details, you should consider using the AWS IOT Core MQTT service. This already offers you all that you are looking for, and means that you do not need to worry about the aspects that you describe. It is also likely that for certain volumes (you do not mention your message volumes) that you could fit into the AWS Free Tier for MQTT.

For high availability over MQTT in an Active Active setup, I would suggest that you put a network load-balancer in front of your MQTT cluster. The NLB can perform health-checks against your EC2 instances. You can setup the NLB to listen on 1883 and 8883 (for MQTT over TLS). The NLB can also handle the TLS offloading for 8883 to your back-ends.

Then for high-availability, simply place your EC2 instances into an AutoScaling group - that will replace any failed ones. The NLB itself will stop sending traffic to one that fails it health-checks, while auto-scaling will replace it.

The only impact to your clients would be that existing MQTT TCP sessions to the failed node would need to be re-established - and the NLB will ensure that any re-establishment goes to a new healthy node.

As for RTO - in an HA situation you should not have any down-time, since there is always capacity for the broker. Similarly RPO is determined by your architecture. The MQTT protocol itself has built-in Quality of Service. If you do not want to lose any messages - then you should consider QOS1 or "at least once" delivery of your messages. This results in acknowledged delivery. Thus if the broker is down - your client will know to deliver again until the message is acknowledged.

You don't provide information on which broker you are using - so it is difficult to get more specific on HA implementation on the broker.

AWS
EXPERT
answered a year ago
  • @Max Clements sorry I didn't get you what you meant by which broker you are using? Well we have installed MQTT clients on the EC2 instance and right now only one instance is there which is responsible for DR strategy. Can you please also suggest how to improve this architecture? Is it that I need to launch one more EC2 instance that can work in Passive mode or how to better utilize the only instance achieving the concept of Active Passive DR strategy? Please suggest.

0

For your requirement of having a active-passive disaster recovery strategy you can make use of Pilot Light strategy or Warm Standby strategy depending on your cost and ease of operation.

Pilot Light Strategy

  • With the pilot light strategy, the data is live, but the services are idle. Live data means the data stores and databases are up-to-date (or nearly up-to-date) with the active Region and ready to service read operations.
  • In the pilot light strategy, basic infrastructure elements are in place like Elastic Load Balancing and Amazon EC2 Auto Scaling.
  • To “turn on” these instances, we use an Amazon Machine Image (AMI) that was previously built and copied to all Regions.
  • This AMI creates Amazon EC2 instances with exactly the operating system and packages we need

Warm Standby

  • The warm standby strategy maintains live data in addition to periodic backups.
  • The difference between the two is infrastructure and the code that runs on it.
  • A warm standby maintains a minimum deployment that can handle requests, but at a reduced capacity—it cannot handle production-level traffic.
  • Before failover, the infrastructure must scale up to meet production needs.

**Difference between them ** The warm standby strategy deploys a functional stack, but at reduced capacity. The DR endpoint can handle requests, but cannot handle production levels of traffic.

RTO for these strategies is different. Warm standby can handle traffic at reduced levels immediately. Then it requires you to scale out this existing deployment, which gives it a lower RTO time than pilot light. This is because pilot light requires you to first deploy infrastructure and then scale out resources before the workload can handle requests.

For more info refer these blog posts - https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/ https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iii-pilot-light-and-warm-standby/

answered a year ago
0

Can you also help me with the RPO and RTO for this?

You've answered your own question about the RTO:

it can afford from 5-10 minutes of downtime.

If the business can tolerate the service being unavailable for ten minutes, then the RTO is ten minutes.

RPO is something that is specified by the business, thought of in terms of the number of minutes or hours worth of data whose loss could be tolerated.

Once the business provides you with this value you can design your solution to meet it.

profile picture
EXPERT
Steve_M
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions