AWS re:Invent 2024 - Deployment best practices for reliable rollouts using Amazon ECS

6 minute read
Content level: Advanced
0

This blog post summarizes key highlights from the AWS re:Invent 2024 session presented by Vibhav Agarwal (Principal Product Manager) and Robert Northard (Specialist Solutions Architect, Containers). We'll explore best practices for achieving safer and faster deployments while maintaining visibility into the deployment process using Amazon ECS

During AWS re:Invent 2024, Vibhav Agarwal and Robert Northard shared valuable insights about implementing reliable deployment strategies using Amazon ECS. The session focused on three critical dimensions of successful deployments: safety, speed, and visibility.

Deployment Strategies Overview

he speakers discussed two main deployment strategies available in Amazon ECS:

  • Rolling Deployments: A strategy that gradually replaces the previous version of an application with a new version in increments until 100% complete. This is the most commonly used approach in ECS.
  • Blue/Green Deployments: A strategy where the new version (green) is deployed alongside the existing version (blue). Traffic is switched to the new version after testing and validation, allowing for quick rollback if needed.

The presentation primarily focused on rolling deployments, as they're the default and most widely used strategy in ECS.

Core Deployment Goals

The speakers emphasized three key aspects that teams should focus on:

  1. Safety: Ensuring deployments don't impact end users or service availability
  2. Speed: Minimizing the time taken to roll out changes
  3. Visibility: Maintaining clear insight into deployment progress and health

Making Deployments Safer

Robert Northard emphasized three fundamental aspects for achieving safer deployments in Amazon ECS:

  • Smart Health Check Implementation: It was highlighted the importance of choosing the right health check strategy. While it might be tempting to create comprehensive health checks that validate all system components, this often leads to deployment issues. Instead, they recommend implementing "soft" health checks that only validate if your application is ready to receive traffic. This prevents situations where unrelated issues (like a temporary database slowdown) could block your deployment. Check this blog post for Advanced Techniques for Amazon ECS Container Health Checks
  • Graceful Application Shutdown: One of the most overlooked aspects of deployment safety is how applications handle shutdown signals. During a deployment, ECS needs to stop existing tasks to replace them with new versions. Applications that don't handle these shutdowns properly can drop user requests or leave work incomplete. The speakers emphasized implementing proper shutdown handlers that allow your application to finish processing current requests before stopping.
  • High Availability Considerations: For production workloads, the speakers recommended maintaining service availability during deployments through:

A key takeaway from this section was that deployment safety isn't just about preventing bad code from reaching production - it's about ensuring your application can handle the deployment process itself without impacting users.

Optimizing Deployment Speed

Vibhav Agarwal shared that deployment speed is often a challenge for teams, especially those running Java applications or working with large container images. He presented three key levers that teams can use to improve deployment speed:

  1. ECS Scheduling Speed: The simplest and most impactful change teams can make is adjusting their ECS service deployment configuration. The speakers recommended:
    • Setting maximum percent to 200% in most cases, which gives ECS the flexibility to launch new tasks before stopping old ones
    • Adjusting minimum healthy percent based on environment needs (100% for production, can be lower for development)
    • Understanding that these settings directly impact how quickly ECS can replace tasks during a deployment
  2. Task Launch Time: The speakers identified container image handling as the main bottleneck in task launches. Two new technologies were highlighted to address this:
    • SOCI (Seekable OCI): An open-source tool that enables lazy loading of container images. Instead of waiting for the entire image to download, containers can start with just the necessary layers.
    • Checkpoint Restore: Particularly valuable for Java applications, this technology allows you to snapshot an already-warmed-up application, significantly reducing startup times.
  3. Task Shutdown Optimization: The presenters emphasized that while it's tempting to reduce shutdown timeouts to speed up deployments, this needs to be balanced with safety. They recommended:
    • Keeping shutdown timeouts aligned with your application's needs
    • Optimizing load balancer draining configurations
    • Understanding that faster isn't always better - the goal is to be as fast as possible while maintaining reliability

Monitoring and Deployment Visibility

A significant portion of the talk focused on new features and best practices for monitoring deployments. Vibhav highlighted that even with perfect configuration, deployments can still fail, making visibility crucial.

  • New Service Deployment Features: Amazon ECS recently introduced Service Revisions and Service Deployments, providing immutable snapshots of service configurations and deployment processes. These features store 90 days of deployment history, allowing teams to track changes, success rates, and debug issues when needed.
  • Deployment Protection: The speakers emphasized two key protection mechanisms. First, Circuit Breaker, described as the most important feature teams should enable. It automatically monitors task health during deployments and can trigger rollbacks when needed. The feature was recently improved to be more responsive for services with fewer than 10 tasks. The second mechanism is CloudWatch Alarms integration, which monitors application behavior beyond basic task health. Teams can configure alarms based on business metrics like HTTP errors, API latency, or queue depths, automatically rolling back deployments if these metrics deteriorate.
  • Measuring Deployment Success: The speakers recommended using Container Insights to track deployment frequency and duration, helping teams understand their deployment velocity and identify areas for improvement.

Conclusion

The session demonstrated that achieving reliable deployments with Amazon ECS requires a balanced approach between safety, speed, and visibility. The speakers emphasized the importance of proper health checks, deployment configurations, and new features like SOCI and Checkpoint Restore. By enabling Circuit Breaker, implementing proper shutdown handling, and using the new service deployment features, teams can build more reliable deployment processes incrementally, focusing first on the basics before moving to optimizations.

For those interested in diving deeper into these concepts, the full session recording is available on the AWS YouTube channel, and additional resources including workshops and documentation can be found at the session resource page.

profile pictureAWS
EXPERT
published 20 days ago123 views