Following best practices in designing resilient applications – Part 1

15 minute read
Content level: Advanced
0

This article is the first part of a series on resilience best practices and key design principles that can minimize business disruptions during outages.

Introduction

To prevent business disruptions during outages, you must design your workloads with resilience in mind. If your architecture doesn't use best practices for cloud resilience, then your business can be stuck dealing with outages and losing money. AWS works closely with customers to assess their workload design, identify improvements, and provide guidance on how to implement resilient architectures. Cloud resilience isn't only a feature, but a strategic necessity.

This article explores the importance of resilience for businesses. It also shares insights from AWS Support that highlight how design principles, such as designing for failure and implementing loose coupling, can improve application robustness. Understanding resilience principles and using AWS services and expertise can help you create highly available and fault-tolerant workloads that tolerate failures and meet your organization's needs.

Importance of resilience

Resilience safeguards applications from disruptions so that applications remain functional and reliable. In the dynamic field of technology where failures are inevitable, resilient designs are critical for a seamless user experience.

Resilient applications provide the following benefits:

  • Business continuity

  • Customer satisfaction

  • Long-term success

When you incorporate resilience, you provide your applications with the capacity to recover from challenges and thrive in dynamic environments.

If your applications aren't resilient, then disruptions in strategic workloads can affect safety, customer experience, reputation, and revenue. For example, some Automotive Original Equipment Manufacturers (OEMs) aren't designed for multi-site disaster recovery. These OEMs don't offer critical connected vehicle capabilities, such as door unlock and SOS calls, during a regional outage. Some of the automotive manufacturer's vehicle data that's shared with their partners are time-sensitive. For example, fuel and odometer data are often shared with rental companies. If there's a delay when this data is delivered, then the delay can affect the reputation of the data service and the customer's revenue.

Resilience involves a quantitative aspect. Imagine a scenario where workloads aren't resilient, and an outage occurs. Rather than assessing the potential business loss during downtime for your customers, it's essential to integrate resilience into applications proactively. This proactiveness can help make sure that you're prepared for outages with minimal impact to your operations.

It pays to be well-prepared

Achieving resilience for the strategic workloads is a shared responsibility:

  • Resilience of the Cloud: AWS is responsible for making sure that the infrastructure that runs all the services offered in the AWS Cloud is resilient.

  • Resilience in the Cloud: Customer responsibility is determined by the AWS Cloud services that a customer selects. These selections determine the amount of architecture, design, and configuration work that the customer must perform as part of making their workload resilient.

With early preparation, a framework to attain resilience and operational excellence can be established to align with Recovery Time Objective (RTO) and Recovery Point Objective (RPO) goals. RTO is a measure of how quickly your application can recover after an outage. RPO is a measure of the maximum amount of data loss that your application can tolerate.

Various types of disruptions, such as application interruptions, dependency on a single infrastructure component, and the use of Availability Zones (AZs) or Regions, must be considered. It's vital to define the speed that recovery must occur after these failures and the acceptable level of data loss during these incidents.

To get the most from the AWS Shared Responsibility Model, follow these guidelines:

Know your workloads

To enhance your application's resilience, do the following actions:

  • Use the AWS Well-Architected Framework to evaluate and design your workload.

  • Document and regularly update information about your application's business and technical owners.

  • Identify and catalog all dependent resources.

  • Implement automation to keep information current.

  • Use AWS services, such as AWS Config and AWS Systems Manager, to track and manage your resources.

  • If you're subscribed to AWS Enterprise Support or Enterprise On-Ramp Support plans, then contact your Technical Account Manager (TAM). Your TAM can provide additional guidance on how to assess your workload resiliency and implement best practices for robust architectures.

Build resilient systems

Use the Well-Architected Framework and AWS Industry Solutions and patterns to build a resilient system. Because resiliency becomes expensive to incorporate as an afterthought, it's important to complete these actions early.  

Strive for operational excellence

To help with operating your system with minimum disruptions, you can implement observability, operations-as-code, and clear organization goals and risk management processes. By following this approach, you can reduce the risk of costly downtimes for your critical workloads.

Importance of setting resilience goals

To fortify applications against unforeseen challenges, it's important to set resilience goals. These goals provide a clear roadmap to design and implement resilient strategies. It's a best practice to align resilience strategies with your business goals:

  • Define comprehensive resilience metrics, such as the following: RTO, RPO, Mean Time to Repair(MTTR), Mean Time Between Failures (MTBF), and Error Budgets.

  • Establish tiered Service Level Agreements (SLAs) and Service Level Objectives (SLOs): Create different tiers based on criticality of application components. Then, align these tiers with business impact and customer expectations.

  • Implement proactive resilience measures: Use chaos engineering practices to identify weaknesses before they cause outages. Then, conduct regular resilience drills to test and improve recovery processes.

  • Quantify the business impact of downtime: Calculate potential revenue loss, customer churn, and reputational damage. Then, use these figures to justify investments in resilience improvements.

  • Balance resilience with other business priorities: Consider cost-effectiveness of different resilience strategies. Then, evaluate the trade-offs between perfect uptime and speed of innovation.

With this approach, organizations can create a resilience strategy that meets technical requirements and aligns closely with overall business objectives and constraints.

Resilience goals not only enhance preparedness but also serve as a benchmark to measure the effectiveness of resilience strategies over time. These goals provide a proactive and systematic approach to building applications that can withstand disruptions and contribute to sustained operational excellence.

To be well-prepared, you have to understand the significance of resilience and establish clear goals, such as RTO and RPO. Some important principles to consider when you design your cloud architecture are:

  • Design for failure, and nothing fails

  • Build security in every layer

  • Leverage different storage options

  • Reduce dependency on control plane

  • Implement static stability

  • Loose coupling sets you free

  • Don't fear constraints

A simple web-app architecture is used for the examples in this article. The architecture involves a front-end hosted on Amazon Elastic Compute Cloud (Amazon EC2) that uses a database for all storage and lookup. The architecture also uses Amazon Route 53 for DNS and a single Elastic Load Balancing (ELB) IP address. The principles apply for other cloud architectures, and aren't limited to the examples provided.

Principle 1: Design for failure, and nothing fails

To design for failure, let's start at the beginning of the design process with a single user. In this example, a typical full stack architecture with a single Amazon EC2 host uses the following AWS services:

  • Route 53 for DNS

  • A single ELB IP address

  • A single EC2 instance

With a full stack development on a single host, you likely have the following components:

  • A web app

  • A database with a processing logic component

The following diagram shows a basic architecture configuration that's used to host a web application:

Enter image description here

When a user initiates a request, Route 53 handles DNS resolution. Behind this service, an EC2 instance runs both the web application and database on a single server. To direct traffic, you must attach an Elastic IP address, which links to the webstack at that specific IP address.

When you consider how to enhance this architecture, an easy solution that you might consider is to upgrade to a larger EC2 instance. However, this approach poses the following challenges:

  • There's an absence of failover mechanisms.

  • There's a lack of redundancy.

  • There's a heavy reliance on this single instance. This solution highlights the importance of the first principle in cloud architecture: Design for failure. As emphasized by Amazon.com's CEO, Werner Vogels, "Everything fails, all the time." This principle underscores the necessity of anticipating failures in your architecture. When you assume that every component will eventually fail, your application can remain resilient even when individual elements have issues. A practical goal in designing for failure is to make sure that your application continues to work, even if the physical hardware of one server has a failure.

Let's consider how to improve the example application and scale up to meet resilience best practices.

First, separate the components in the single host into more than one component:

  • Separate the web component into a separate instance to host the web application.

Separate the database into a separate database from the original instance to operate independently. To do this, you can use an Amazon Aurora DB instance.

Enter image description here

Next, address the lack of failover and the redundancy issues:

  • Add an additional web instance in another Availability Zone (AZ).

  • To achieve multi-AZ deployment, create an Aurora Replica or Reader node in a different AZ.

  • To share the load between the two web instances, replace your Elastic IP address with an Elastic Load Balancer.

The architecture looks similar to the following diagram:

Enter image description here

After these changes, the application is more scalable and has built-in fault tolerance.

The following are best practices to follow when you design for failure:

  • Use multiple AZs: When you distribute your application across multiple AZs, you reduce the risk of complete system failure. If one AZ has issues, then your application can continue to run in other AZs, and you can continue to use your services and minimize downtime.

  • Use Elastic Load Balancing: Elastic Load Balancing distributes incoming traffic across multiple targets, such as EC2 instances, in multiple AZs. This distributed traffic improves application availability, and can help you handle failed instances seamlessly. With the ability to redirect traffic from failed instances to healthy ones, you can enhance your overall system resilience.

  • Use Elastic IP addresses: Elastic IP addresses provide a static IP address that you can remap to another instance in case of failure. To recover faster and reduce down time, redirect traffic to healthy resources quickly without the need to change DNS settings.

  • Monitor in real-time with Amazon CloudWatch: With CloudWatch, you can monitor in real-time to detect issues quickly and proactively respond. When you set up CloudWatch alarms and metrics, you can identify potential failures before they affect your users and alert your team for rapid intervention.

  • Use database multi-AZ deployment: Multi-AZ deployments for databases provide enhanced availability and durability. If infrastructure failure occurs, then database operations can automatically failover to a standby replica in another AZ. They can also minimize disruption to your application and support data consistency.

Principle 2: Build security in every layer

When you embed security measures at each layer, from infrastructure to application, you create a robust defense against potential vulnerabilities. This article provides best practices that enhance the resiliency of your applications. These practices are integral to build robust, fault-tolerant systems that can withstand various threats and failures.

  • Implement multi-factor authentication (MFA): Use multiple forms of verification to access critical systems. MFA adds an extra layer of security.

  • Encrypt data at rest and in transit: Use encryption protocols to protect sensitive data both when it's stored and when it's transmitted between components.

  • Regularly update and patch systems: To stay vigilant against vulnerabilities, promptly apply software patches and updates to mitigate potential security risks.

  • Implement least privilege access: Limit access rights for users and applications to only what's necessary for their specific roles or functions. This reduces the vulnerabilities attack surface.

  • Deploy network segmentation: To contain breaches and prevent lateral movement by attacker, divide your network into smaller, isolated segments.

  • Monitor and audit systems: Implement robust monitoring and auditing mechanisms to detect unusual activities or security breaches.

  • Backup and disaster recovery: Regularly back up critical data and have a robust disaster recovery plan in place to restore operations swiftly in case of an incident.

  • Implement Distributed Denial of Service (DDoS) protection: Use DDoS protection services to mitigate and withstand large-scale attacks that aim to disrupt service availability.

  • Use Web Application Firewalls (WAF) and Intrusion Detection/Prevention Systems (IDS/IPS): Deploy AWS WAF and IDS/IPS to monitor and filter incoming traffic for malicious activities.

  • Continuous security training and awareness: To foster a security-conscious culture, educate employees and stakeholders about security best practices and keep them informed about evolving threats.
    Note: For a complete guide on AWS security best practices, see  Best Practices for Security, Identity, & Compliance.

When you integrate these security best practices into your resilience strategy, you can bolster the protection of your systems and applications against various threats. This encourages sustained operations, even in difficult situations.

Principle 3: Use many storage options

The principle of "Use many storage options" acknowledges that in cloud architecture, one size doesn't fit all. Because it's important to tailor storage solutions to specific requirements, this principle encourages a nuanced approach to data storage. Like how a wardrobe contains various clothing for different occasions, your storage strategy must include a variety of options.

Based on your needs, choose the best storage option for your needs:

  • Amazon Simple Storage Service (Amazon S3) and Amazon S3 Glacier: Amazon S3, a versatile object storage service, excels in scalable and low-latency data access. Amazon S3 Glacier, a cost-effective archival solution, provides secure, long-term storage for infrequently accessed data within AWS.

  • Amazon S3 Intelligent-Tiering: Amazon S3 Intelligent-Tiering automatically optimizes storage costs by moving objects between the access tiers -- frequent and infrequent access -- based on changing usage patterns. This intelligent storage class helps with efficient performance and savings without manual intervention.

  • Amazon CloudFront: CloudFront is a content delivery network (CDN) service that accelerates the delivery of web content globally to users, including images, videos, and APIs. With low latency and high data transfer speeds, CloudFront enhances user experiences by distributing content from edge locations strategically positioned around the world.

  • Amazon DynamoDB: Amazon DynamoDB is a fully managed NoSQL database service that provides seamless and fast performance at any scale. With automatic scaling, low-latency access, and native support for document and key-value data models, DynamoDB simplifies database management for developers.

  • Amazon Elastic Block Store (Amazon EBS): Amazon EBS provides scalable and durable block storage volumes for use with EC2 instances. Amazon EBS offers low-latency performance and customizable volume options to facilitate reliable and efficient data storage solutions in the AWS Cloud.

  • Amazon Relational Database Service (Amazon RDS): Amazon RDS is a fully managed database service that simplifies database administration tasks. Amazon RDS offers support for multiple database engines, automated backups, and scalable compute resources. Amazon RDS promotes seamless and reliable relational database deployment in the AWS Cloud.

  • Amazon Redshift: Amazon Redshift is a fully managed data warehouse service that's designed for high-performance analysis and standard SQL queries. With automatic scaling, columnar storage, and integration with popular business intelligence tools, Amazon Redshift helps organizations analyze large datasets with speed and efficiency in the AWS Cloud.
    With the web application example in mind, we can shift some of the load around to pick a storage option:

  • Move any static assets from the webapp instances to Amazon S3, and then use CloudFront to serve those objects. These assets are all of your images, videos, css, javascript, and other heavy static content.

  • Use an Amazon S3 origin to serve these files. Then, these files can be globally cached and distributed with CloudFront. This takes the load off of your webservers so that you can reduce your footprint in that web tier.

Enter image description here

  • You can also move session information to a NoSQL database like DynamoDB or to a cache like Amazon ElastiCache. For this article, DynamoDB is used because of the availability of connectors in many of the AWS SDKs.

  • Use ElastiCache to store common database query results. This prevents you from hitting the database too much and takes the load off of your database tier.

  • For horizontal scaling, remove session state from your web or app tier. This is called making your tier "stateless". While making a tier stateless simplifies scaling and reduces dependencies on specific servers, it can introduce additional complexity, such as managing external session storage. This complexity requires thoughtful design to make sure that it doesn't impact overall system resilience. In essence, the trade-off between scalability and complexity must be carefully balanced to maintain resilience.

Enter image description here

Part 2 is going to explore the remaining four design principles that form the backbone of resilient applications within AWS. Stay tuned for insights into best practices that guide you towards designing applications that combat disruptions and are agile and reliable for today's digital landscape.

Conclusion

This article introduces resilience and the importance of setting resilience goals. It's important to understand that a well-prepared customer is more likely to withstand application failures and outages. AWS Support engineers and TAMs can help you with general guidance, best practices, troubleshooting, Enterprise Support-specific entitlements, and operational support on AWS. To learn more about our plans and offerings, see AWS Support.

About the author

Enter image description here

Rav Bommakanti is a Senior TAM with AWS Energy. He's passionate about solving complex customer problems. With more than 16 years of experience in IT across various domains and technologies, he brings vast expertise in developing resilient, cost-effective, and innovative solutions. In his free time, he enjoys traveling and photography.