Skip to content

re:Invent 2025 - Build safe and resilient deployment pipelines for Amazon ECS

6 minute read
Content level: Intermediate
0

Deploying new container versions without downtime and with the ability to roll back quickly are requirements most teams running containerized workloads share. This session from re:Invent 2025 walks through the deployment strategies available in Amazon ECS, covering how each one works, how to configure it, and when to choose it over the alternatives.

Rolling out a new version of a containerized application carries real risk. A failed deployment can mean downtime, degraded customer experience, and a scramble to restore the previous state. Islam Mahgoub, Senior Solutions Architect at AWS and containers specialist, presented session CNS353 at re:Invent 2025 to address this challenge directly. In this post, we'll walk through the two main Amazon Elastic Container Service (Amazon ECS) deployment strategies, rolling updates and blue/green deployments, and explore how lifecycle hooks let you add custom logic to the deployment workflow.

Rolling Updates: Gradual Replacement with Configurable Bounds

Amazon ECS services are defined by a combination of a task definition and service configuration. When you update the task definition and point your service to it, ECS tracks this transition as a service deployment object, which records progress and holds references to both the source and target service revisions. This gives you a clear audit trail of what changed and when.

Rolling update deployment replaces tasks running the old version with tasks running the new version gradually, rather than all at once. Two parameters control this behavior: minimum healthy percent and maximum percent. Minimum healthy percent sets the floor on how many tasks can be running during a deployment, protecting you from dropping below the capacity needed to serve traffic. Maximum percent sets the ceiling on total tasks, preventing runaway scaling during the transition. Together, these two settings let you balance availability, deployment speed, and cost.

In practice, you can tune these parameters to fit your situation. If you can tolerate a brief reduction in capacity in exchange for a faster deployment, you lower the minimum healthy percent. If you want to bring new tasks up before removing old ones, you set the maximum percent high enough to allow both sets to coexist. The deployment proceeds by starting new tasks, waiting for them to become healthy, then terminating old ones, repeating this cycle until all tasks run the new version.

The trade-off with rolling updates is rollback speed. If you discover a problem after the new version has been widely deployed, restoring the previous version requires re-provisioning tasks running the old container image. That takes time, which matters in a production incident.

Blue/Green Deployments: Two Environments, Instant Rollback

Blue/green deployment addresses the rollback latency problem by maintaining two parallel environments throughout the deployment lifecycle. The blue environment runs the current production version. The green environment runs the new version. Traffic shifts from blue to green, but blue stays alive for a configurable bake time window. If anything goes wrong during that window, a single rollback action reroutes traffic back to blue immediately, without re-provisioning tasks.

The deployment controller manages this traffic shifting by modifying Application Load Balancer (ALB) listener rules. This requires two target groups (one per environment) and an AWS Identity and Access Management (IAM) role that grants the deployment controller permission to update those listener rules. When you configure the service for blue/green, you provide the production listener, the test listener, and both target groups.

One practical advantage of this setup is the separation between production and test traffic. The test listener routes to the green environment while production traffic still flows to blue. This gives you a live endpoint for validating the new version before it handles any production load. You can run automated checks, smoke tests, or manual verification against the test listener URL with confidence that your customers are unaffected.

The full lifecycle of a blue/green deployment moves through five stages: scale up (provision the green environment), test traffic shift (route test traffic to green), production traffic shift (route production traffic to green), bake time (both environments live, blue available for rollback), and cleanup (terminate the blue environment). The bake time duration is configurable, giving you a monitoring window under real production load before the old environment is decommissioned.

The capacity trade-off here is real. During a blue/green deployment, you are running two full sets of tasks simultaneously, which roughly doubles the compute cost for the duration of the deployment. For most teams, the operational benefit of instant rollback justifies this temporary increase. For teams where cost is a primary constraint, rolling updates with carefully tuned health percent parameters may be the better default.

Extending Your Workflow with Lifecycle Hooks

Lifecycle hooks let you inject custom logic at specific points in the blue/green deployment workflow. Each hook is an AWS Lambda function that returns one of three states: in progress, success, or fail. When a hook returns in progress, the deployment pauses at that stage and polls the function again after a configured interval. A success response advances the deployment to the next stage. A fail response triggers a rollback.

This mechanism supports a range of automation patterns. The session demonstrated a manual approval workflow built around the post-test-traffic-shift lifecycle stage. A Lambda function configured as the hook sets a deployment state of "pending" in Amazon S3, then sends a notification through Amazon Simple Notification Service (SNS) to a reviewer. Because the hook fires after the test listener has already been updated, the reviewer can access the new version at the test listener URL to verify it directly. Once satisfied, they update the state in S3 to "approved." The next hook invocation reads that state and returns success, advancing the deployment to the production traffic shift stage.

The same pattern applies to automated gates. You can configure hooks to run integration tests, call external monitoring APIs, or validate custom health metrics before production traffic shifts. If any check fails, the hook returns fail and the deployment rolls back automatically, without any human intervention.

Key Takeaways

Amazon ECS gives you a range of deployment strategies to match your resiliency requirements and cost constraints. Rolling updates are efficient in terms of capacity and work well as a default for most workloads, with the understanding that rollback requires re-provisioning. Blue/green deployments cost more during the deployment window because you are running two full environments, but they provide near-instant rollback and a built-in validation window before production traffic shifts.

Lifecycle hooks extend the blue/green workflow with custom automation, letting you encode your team's verification process directly into the pipeline. Whether that means automated integration tests, external health checks, or a manual approval step with a reviewer testing the new version on the test listener, hooks give you control over when each deployment stage advances without requiring external tooling to poll deployment status.

For hands-on practice with these patterns, including building the demo shown in this session, the full session recording is available at AWS re:Invent 2025 - Build safe and resilient deployment pipelines for Amazon ECS (CNS353).

AWS
EXPERT
published 2 months ago96 views