Skip to content

Extending Amazon Aurora Auto Scaling: Automated Solution for Read Replica Insufficient Capacity

8 minute read
Content level: Expert
0

This article presents an automated solution for Amazon Aurora insufficient capacity (RDS-EVENT-0031). Using an event-driven Lambda architecture, it automatically provisions read replicas across instance types and availability zones. The solution extends Amazon Aurora's native auto-scaling capabilities by providing enhanced operational flexibility through AWS Lambda and EventBridge, with scale-up triggered by capacity events and automatic scale-down

Overview

Amazon Aurora's built-in auto-scaling provides robust capabilities for managing reader instances. However, when provisioning read replicas encounters insufficient capacity in specific instance types or availability zones (RDS-EVENT-0031), this open-source solution seamlessly extends Aurora's capabilities. Using AWS Lambda and EventBridge, it automatically explores alternative instance types and availability zones, ensuring continuous availability while optimizing costs through intelligent scale-down based on CPU utilization.

Source Code: This solution is available as open-source on GitHub at https://github.com/aws-samples/sample-ManagedAutoScaler

When to Use This Solution

This solution is for organizations that need:

  • Workloads experiencing RDS-EVENT-0031 insufficient capacity events that require automated resolution
  • Additional flexibility when default instance types experience temporary capacity constraints
  • Automated adaptation across multiple instance types and availability zones
  • Scaling without manual intervention during varying demand patterns, enhancing workload resiliency for applications requiring strict performance standards

The solution works alongside Aurora's native auto-scaling, providing an additional layer of operational flexibility.

Managing Amazon Aurora Insufficient Capacity: Automated Read Replica Provisioning

When Aurora's native auto-scaling attempts to add reader capacity but encounters temporary constraints for your default instance type in a specific availability zone, it generates an insufficient capacity event (RDS-EVENT-0031). Without automation, this requires manual intervention to identify alternative instance types or availability zones with available capacity, provision appropriate reader instances, and later remove them when demand decreases - a process that can take several minutes during critical periods. Some organizations address these challenges by over-provisioning Aurora readers or using Aurora Serverless. While these approaches work for many scenarios, workloads with unpredictable scaling patterns and strict performance requirements often need a tailored solution that can cost-effectively adapt to available capacity while maintaining performance standards.

Event-Driven Aurora Capacity Management Solution

This automated scaling solution extends Aurora's capabilities using AWS Lambda, Amazon EventBridge, CloudWatch, and SNS - fully deployed via Terraform. The solution features an event-driven architecture with two primary components:

  • Scale-Up Component: A Lambda function triggered by RDS insufficient capacity events (RDS-EVENT-0031) that automatically provisions reader instances using alternative instance types and availability zones based on a predefined preference order.
  • Scale-Down Component: A Lambda function executed on a customizable schedule that monitors CloudWatch metrics for CPU utilization and removes underutilized reader instances created by this solution when demand decreases.

Architecture

The solution uses an event-driven architecture:

  • EventBridge Rule: Captures RDS insufficient capacity events (RDS-EVENT-0031) and triggers the scale-up Lambda function
  • Scale-Up Lambda: Analyzes current reader distribution, checks EC2 capacity availability, and provisions new readers using fallback strategies
  • EventBridge Scheduler: Continuously monitors CPU metrics and triggers the scale-down Lambda function
  • Scale-Down Lambda: Evaluates CPU utilization and removes underutilized reader instances created by this solution while maintaining minimum capacity and AZ distribution
  • CloudWatch Integration: Provides detailed metrics and timing information

Managed Aurora AutoScaler Architecture

Scaling Strategies

The solution supports two distinct approaches to meet different operational priorities:

FeatureInstance Priority StrategyAvailability Zone Priority Strategy
Primary GoalMaintain consistent instance typesBalance Availability Zone distribution
Best ForWorkloads sensitive to instance type changesApplications requiring Multi-AZ resiliency
Instance SelectionExhausts all AZs for each instance type before trying next typeTries all instance types in each AZ before moving to next AZ
Performance ConsistencyHigher (similar instance types)Variable (may mix instance types)
Resiliency FocusLower (might concentrate in fewer AZs)Higher (prioritizes AZ distribution)

Enhanced Availability and Cost Optimization Benefits

  • Enhanced Availability: Automatically finds available capacity across multiple AZs and instance types when your default option experience temporary constraints.
  • Cost Efficiency: Scales down underutilized readers automatically based on configurable CPU thresholds, avoiding over-provisioning costs.
  • Zero Manual Intervention: Fully automated response to scaling events with few minutes scale-up times.
  • Flexible Strategies: Choose between instance-priority (maintains consistent instance types) or AZ-priority (balances geographic distribution) scaling approaches.

Solution Value

This solution enhances Aurora's auto-scaling capabilities to deliver greater operational flexibility and resilience:

CapabilityNative Aurora Auto ScalingEnhanced with This Solution
Instance type flexibilitySingle preferred type✅ Flexible instance types
Multi-AZ adaptationSingle AZ attempt✅ Provisions across multiple AZs
Scale-down automationManual monitoring and intervention✅ Automatic optimization based on CPU metrics
Capacity planningStatic provisioning approach✅ Dynamic adaptation to real-time capacity availability
Cost managementPeriodic manual reviews✅ Continuous automated optimization

Prerequisites

Development Requirements:

  • AWS CLI 2.15.0+
  • Python 3.13+
  • Terraform 1.0.0+

AWS Requirements:

  • Operational Aurora PostgreSQL (11.x+) or MySQL (8.x+) cluster with Aurora auto-scaling enabled and configured
  • CloudWatch metrics enabled for your Aurora cluster
  • VPC with at least two private subnets across Availability Zones
  • Appropriate IAM permissions for Aurora operations, Amazon EC2 capacity checking, CloudWatch metrics, and EventBridge scheduling

Cost Considerations:

  • Lambda and EventBridge: $20-50/month per Aurora cluster (varies based on scaling frequency)
  • Additional Aurora reader instances: Variable based on usage patterns

Deployment

  1. Clone the Repository
git clone https://github.com/aws-samples/sample-ManagedAutoScaler
cd sample-ManagedAutoScaler
  1. Configure Your Environment

Create a terraform.tfvars file with your configuration:

region = "us-east-1"
db_cluster_id = "your-aurora-cluster"
db_engine = "aurora-postgresql"
aurora_reader_tier = 15
cpu_threshold = 10.0
cpu_lookback_minutes = 5
notification_email = "your-email@example.com"
preferred_instance_type = "r6i.large"
instance_types_priority = ["r6g.large", "r5.large", "r6i.xlarge"]
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
fallback_strategy = "instance-priority"
enable_sns = true
enable_security_hardening = true

Key configuration parameters include:

  • dbclusterid: Your Aurora cluster identifier
  • dbengine: aurora-postgresql or aurora-mysql
  • instancetypespriority: Ordered list of fallback instance types
  • availabilityzones: Preferred AZs for deployment
  • fallbackstrategy: instance-priority or az-priority
  • cputhreshold: CPU percentage threshold for scale-down (default: 10%)
  • cpulookbackminutes: Duration to monitor CPU before scaling down (default: 5 minutes)
  • aurorareadertier: Priority tier for reader instances (0-15, higher = lower priority)
  1. Deploy with Terraform
cd terraform
terraform init
terraform plan
terraform apply
  1. Verify Deployment

Check Lambda functions:

aws lambda get-function-configuration \
  --function-name aurora-autoscale-up \
  --query '{Name:FunctionName,Runtime:Runtime,Environment:Environment.Variables}'

Verify EventBridge rule:

aws events list-rules \
  --name-prefix "rds-insufficient" \
  --query 'Rules[].{Name:Name,State:State,EventPattern:EventPattern}'

Check EventBridge Scheduler:

aws scheduler list-schedules \
  --query 'Schedules[?starts_with(Name, `aurora-`)].{Name:Name,State:State,ScheduleExpression:ScheduleExpression}'

Cleanup

To remove all resources:

cd terraform
terraform destroy

Note: Reader instances will remain but won't be automatically managed after cleanup.

Production Considerations

⚠️ Critical Guidelines:

  • Writer Failover Planning: Writer failover to larger reader instances will increase costs, and reverting to smaller instances might affect performance. Always ensure reader instances are sized equal to or larger than the writer to maintain consistency during failovers.
  • Availability Zone Distribution: Distribute your DB instances across multiple Availability Zones to mitigate downtime risk. Maintain at least one reader per Availability Zone and regularly verify that auto scaling maintains proper distribution as instances are added or removed.
  • Aurora Priority Tier Configuration: Set instances of the same type and size as the writer to the highest priority tier (0-1) for optimal failover targets. Structure your tier configuration to make sure the most capable instances are prioritized for writer failover scenarios.
  • Instance Type Validation: Define minimum instance size requirements based on your workload needs. Create pools of compatible, thoroughly tested instance types, and validate performance across different configurations before adding them to your production environment.
  • Monitoring recommendations: Implement monitoring for Availability Zone distribution, cost alerts for unexpected scaling events, failover impacts, resource utilization across instance types, and single-AZ concentration risks.

Conclusion

This solution empowers your Aurora clusters with automated, adaptive scaling that extends Aurora's native capabilities during capacity constraints while optimizing costs and maintaining high availability. By working alongside Aurora's built-in functionality, it provides an additional safety net for specific workloads. The automated scale-down feature helps optimize costs by removing unnecessary capacity during low-demand periods, while the adaptive scale-up ensures your database can handle unexpected load increases with multiple fallback options.

References

GitHub Repository

Amazon Aurora Documentation

Amazon RDS Events

Aurora Priority Tier