Skip to content

Handling EC2 Capacity Constraints with Automated Instance Type Flexibility

6 minute read
Content level: Intermediate
3

Automated solution for EC2 workloads on periodic start/stop schedules that encounter InsufficientInstanceCapacity errors. Intelligently switches to alternative instance types and reverts on stop.

Handling EC2 Capacity Constraints with Automated Instance Type Flexibility

When running EC2 workloads, you might encounter InsufficientInstanceCapacity errors that prevent your instances from starting. This is particularly challenging for workloads that start and stop on a regular basis, such as development environments, batch processing jobs, or scheduled compute tasks.

While On-Demand Capacity Reservations (ODCRs) are ideal for guaranteeing capacity, they require advance planning and commitment. Here's an automated solution that handles capacity constraints in real-time by intelligently switching to alternative instance types.

The Strategy

This solution automatically responds to InsufficientInstanceCapacity by:

  1. Detecting StartInstances failures via CloudWatch Events monitoring CloudTrail
  2. Attempting individual starts for each failed instance
  3. Finding compatible alternatives using EC2's instance requirements API
  4. Modifying instance types to available alternatives sorted by price
  5. Reverting on stop to restore original instance types

Architecture Components

Architecture Diagram

  • CloudWatch Events Rules: Monitor CloudTrail and EC2 state changes
  • Lambda Functions: Handle recovery and revert logic
  • DynamoDB Table: Deduplication to prevent duplicate processing
  • SSM Parameter Store: Dynamic configuration per instance or global defaults
  • IAM Roles: Scoped permissions requiring Flexible=true tag

How It Works

Start Workflow

Start workflow sequence diagram

  1. CloudWatch Events Rule monitors CloudTrail for StartInstances API calls that fail with Server.InsufficientInstanceCapacity
  2. Lambda function is triggered with the failed instance IDs
  3. For each instance tagged with Flexible=true:
    1. Attempts to start with current instance type
    2. If that fails:
      1. queries compatible instance types based on configurable criteria
      2. queries prices for eligible instances
      3. Modifies to the cheapest compatible alternative
      4. Tags the instance with OriginalType for later restoration
      5. Retries the start operation

Stop Workflow

Stop workflow sequence diagram

  1. CloudWatch Events Rule monitors EC2 instance state changes to stopped
  2. Lambda function checks for instances with OriginalType tag
  3. Waits for instance to fully stop
  4. Reverts to original instance type
  5. Removes the OriginalType tag

Configuration Flexibility

The solution supports three levels of configuration:

  1. Instance-specific: Tag instances with FlexibleConfigurationArn pointing to a custom SSM parameter
  2. Global default: Use /flexible-instance-starter/default SSM parameter
  3. Fallback: Embedded configuration in Lambda function

Key Configuration Parameters

Memory and Storage Buffers

{
  "memoryBufferPercentage": 5,
  "localStorageBufferPercentage": 5
}

Allows selecting instances with slightly less memory or local storage (e.g., 5% buffer means an 8GB instance can match a 7.6GB target).

CPU and Memory Multipliers

{
  "maxCpuMultiplier": 4,
  "maxMemoryMultiplier": 2
}

Controls how much larger alternative instances can be (e.g., 4x CPU means a 4 vCPU instance can scale up to 16 vCPU).

CPU Manufacturers

{
  "cpuManufacturers": ["intel", "amazon-web-services"]
}

Restricts alternatives to specific CPU vendors for compatibility requirements.

Instance Type Exclusions

{
  "excludedInstanceTypes": ["p*.*", "g*.*", "inf*.*", "trn*.*", "f*.*"]
}

Excludes GPU, inference, and specialized instance families using wildcard patterns.

Bare Metal Control

{
  "bareMetal": "included"
}

Options: included, required, or excluded for bare metal instances.

Implementation

Prerequisites

  • AWS CDK CLI installed
  • Python 3.9 or later
  • AWS credentials configured
  • CloudTrail enabled in your region

Deployment Steps

  1. Clone and setup
git clone https://github.com/aws-samples/sample-flexible-instance-starter
cd flexible-instance-starter
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
  1. Deploy the stack
cdk deploy
  1. Tag your instances
aws ec2 create-tags \
    --resources i-1234567890abcdef0 \
    --tags Key=Flexible,Value=true
  1. Optional: Create custom configuration
aws ssm put-parameter \
    --name /flexible-instance-starter/my-workload \
    --type String \
    --value '{
        "memoryBufferPercentage": 10,
        "maxCpuMultiplier": 2,
        "cpuManufacturers": ["intel"],
        "excludedInstanceTypes": ["t*.*"]
    }'

# Tag instance to use custom config
aws ec2 create-tags \
    --resources i-1234567890abcdef0 \
    --tags Key=FlexibleConfigurationArn,Value=arn:aws:ssm:us-east-1:123456789012:parameter/flexible-instance-starter/my-workload

Key Benefits

  • Automatic recovery - No manual intervention required when capacity issues occur
  • Cost optimization - Alternatives are sorted by on-demand price, selecting the cheapest option first
  • Workload-specific configuration - Different flexibility rules per instance or workload type
  • Transparent operation - Original instance types are automatically restored on stop
  • Audit trail - All actions logged to CloudWatch Logs with detailed reasoning

Important Considerations

Compatibility

  • Only processes instances tagged with Flexible=true
  • Excludes GPU and specialized instance types by default
  • Respects architecture (x86_64 vs ARM64) and generation constraints
  • Attempts to maintain burstable performance characteristics when applicable

Limitations

  • Does not guarantee capacity availability for alternatives
  • Not suitable for workloads requiring specific hardware features

Costs

  • Lambda execution costs (typically minimal)
  • DynamoDB on-demand pricing for deduplication table
  • Potential increased EC2 costs if larger instance types are used
  • No additional cost for SSM Parameter Store (standard tier)

Security

  • IAM policies are scoped to instances with Flexible=true tag
  • Separate permissions for tag creation/deletion (only OriginalType)
  • CloudWatch Logs retention for audit trail
  • No cross-account or cross-region operations

Monitoring

Monitor the solution through CloudWatch Logs:

# View recovery attempts
aws logs tail /aws/lambda/InstanceRecoveryHandler --follow

# View revert operations
aws logs tail /aws/lambda/InstanceStopHandler --follow

# Check for instances with modified types
aws ec2 describe-instances \
    --filters "Name=tag-key,Values=OriginalType" \
    --query 'Reservations[].Instances[].[InstanceId,InstanceType,Tags[?Key==`OriginalType`].Value|[0]]' \
    --output table

Advanced Use Cases

Per-Workload Configuration

Create different flexibility profiles for different workload types:

# Strict configuration for production databases
aws ssm put-parameter \
    --name /flexible-instance-starter/production-db \
    --type String \
    --value '{
        "memoryBufferPercentage": 0,
        "maxCpuMultiplier": 1,
        "cpuManufacturers": ["intel"]
    }'

# Flexible configuration for batch processing
aws ssm put-parameter \
    --name /flexible-instance-starter/batch-workers \
    --type String \
    --value '{
        "memoryBufferPercentage": 20,
        "maxCpuMultiplier": 8,
        "cpuManufacturers": ["intel", "amd", "amazon-web-services"]
    }'

Cleanup

Remove all resources:

cdk destroy

Conclusion

This automated solution brings flexibility to your EC2 workloads by intelligently adapting to capacity constraints in real-time. By automatically selecting compatible alternative instance types, it positions you to better leverage EC2's diverse instance portfolio and maximize workload availability. The solution seamlessly handles transient capacity challenges while maintaining cost efficiency through price-based selection and automatic restoration of original instance types.

The solution is particularly useful for:

  • Development and test environments where capacity reservations aren't cost-effective
  • Workloads that can tolerate instance type variations

This approach complements traditional capacity planning strategies, providing an additional layer of resilience that helps ensure your workloads remain operational even during periods of capacity constraint.

AWS
EXPERT
published 5 days ago107 views