Skip to content

re:Invent 2025 - Deep Dive: ECS Managed Instances & Blue/Green for Resilient Services

8 minute read
Content level: Advanced
0

Amazon ECS shipped two significant capabilities in late 2025: Amazon ECS Managed Instances, a new compute option between EC2 and Fargate on the control-versus-simplicity spectrum, and redesigned native deployment strategies (blue/green, linear, and canary) that remove the dependency on AWS CodeDeploy. This post covers both, with a focus on how they reduce operational overhead for teams running containerized workloads.

Running containerized workloads on AWS has long presented teams with a frustrating tradeoff: accept the operational overhead of managing EC2 infrastructure yourself, or give up instance flexibility by going fully serverless with Fargate. On the deployment side, rolling out new application versions safely required an external dependency on CodeDeploy, adding complexity to an already nuanced process. At re:Invent 2025, Malcolm Featonby, Senior Principal Engineer at AWS, and Maish Saidel-Keesing, Senior Developer Advocate at AWS, presented two major additions to Amazon ECS that address both problems directly. In this post, we'll explore how Amazon ECS Managed Instances closes the gap between EC2 flexibility and Fargate simplicity, and how the redesigned native deployment capabilities give your team the confidence to ship new versions without a safety net made of guesswork.

Bridging the Gap: Amazon ECS Managed Instances

Amazon ECS has always organized its compute offerings through a concept called capacity providers, each representing a different point on the shared responsibility spectrum. On one end, ECS on Amazon EC2 gives you full control over instance types, placement, scaling, and patching, but that control comes with the full weight of managing it. On the other end, AWS Fargate handles the underlying compute entirely, letting you focus exclusively on your application. The tradeoff is that Fargate imposes upper limits on vCPU and memory per task, and does not support accelerated compute types like GPUs.

Amazon ECS Managed Instances, launched in September 2025, is a new capacity provider type designed to sit between these two options. You still get EC2 instances in your own account, visible when you call DescribeInstances or browse the EC2 console. But ECS takes responsibility for provisioning, scaling, patching, and placement. You provide two IAM roles: an instance role that allows the ECS agent running on each instance to communicate with the ECS control plane, and an infrastructure management role that grants ECS permission to manage EC2 resources on your behalf. Beyond that, you specify the subnets you want used, and the service takes it from there.

Instance selection is handled through attribute-based selection, the same mechanism available in EC2 Auto Scaling groups and fleet configurations. This lets you describe the kind of compute you need rather than naming specific instance types. By default, ECS Managed Instances chooses from the C, M, and R instance families. If your workload needs GPU compute, specific network-optimized instances, or a particular family, you can be as prescriptive as you want. The guidance from the engineering team is to be as general as possible, since a broader specification gives ECS more capacity options, which translates to faster task launches and better availability.

Sizing decisions are driven by your task definition configuration, not by a separately specified instance type. ECS reads the memory reservations, memory limits, and CPU allocations defined at both the container and task level, then selects an EC2 instance that fits. This approach also enables a meaningful advantage over Fargate for bursty workloads: because multiple tasks from the same service share the underlying instance, a container that spikes CPU at startup (common in Java applications doing just-in-time compilation) can burst into unused capacity on the host rather than being capped at a hard Fargate allocation.

Placement follows a spread-first strategy by default. When tasks need to be scheduled, the scheduler first places them on instances already in the cluster that meet placement constraints. Tasks that cannot fit on existing capacity trigger a new EC2 instance request. When choosing instance size, ECS uses a first-fit-decreasing algorithm, targeting the largest instance that satisfies the workload's requirements. The goal is to reduce image pull time by reusing the image cache across tasks landing on the same host, while also keeping costs down through bin packing.

Patching is handled on a 30-day cycle without any action on your part. ECS Managed Instances runs on Bottlerocket, AWS's container-optimized operating system, so you cannot bring a custom AMI. On day zero, each instance is provisioned with the latest Bottlerocket image. Starting at day 14, ECS begins looking for opportunities to replace instances, respecting the EC2 maintenance windows you have configured. By day 21, if instances still have not been replaced, ECS becomes more aggressive and will drain and replace them to meet the 30-day compliance window. Instances are always replaced with fresh ones rather than patched in place, because new instances benefit from EC2's continuous health checks at the fleet level.

The service also manages idle compute continuously. Any time a task stops, ECS sweeps the cluster looking for consolidation opportunities. It will move tasks to pack them more efficiently and deprovision instances that are no longer needed. For diurnal workloads that scale down overnight, ECS will also downscale to smaller instance types, replacing a large instance with a smaller one rather than leaving it running at low utilization.

The ECS team's recommendation on when to use each option is straightforward. Fargate remains the right starting point for most stateless workloads that fit within its vCPU and memory limits. ECS Managed Instances fits best when you need larger task sizes, accelerated compute, or control over the instance family, without taking on the operational responsibility that comes with ECS on EC2. For teams already running on ECS on EC2 primarily because Fargate did not offer the instance flexibility they needed, the migration path to ECS Managed Instances is described as straightforward.

Native Deployments: Blue/Green, Linear, and Canary

In July 2025, Amazon ECS redesigned its deployment capabilities to remove the dependency on CodeDeploy and bring deployment management natively into the service. Three deployment strategies are now available: blue/green, linear, and canary. All three share the same six-phase structure: preparation, deployment, testing, traffic shift, monitoring (bake time), and cleanup.

The preparation phase creates the necessary routing rules in the load balancer and target groups without provisioning any new tasks. The deployment phase then scales up the new (green) version to 100% of the required task count before any traffic moves. ECS always starts new tasks before stopping old ones; this is described by the team as a core operational rule. Once the green service is running at full capacity and passing health checks, test traffic is injected to verify that the new version can accept and respond to requests correctly.

At each phase boundary, you can attach a deployment lifecycle hook. A lifecycle hook is an AWS Lambda function that you define with custom validation logic. The function returns one of three states: passed, in progress, or failed. A failure at any hook automatically triggers a rollback to the previous version. This makes it practical to check things like container image digest matches, target group registration status, or any other condition your team considers a deployment gate.

The three strategies differ in how they handle the traffic shift phase. Blue/green shifts 100% of traffic to the green version in a single step, making it the fastest option. It works best when your deployment pipeline is mature and you are confident in your automated tests, since all customers move to the new version simultaneously. Rollback is fast because the blue version remains live with zero traffic until the cleanup phase completes. Linear deployments move traffic in equal increments, with a configurable bake time between each step. This is the right choice when you want to observe the new version's behavior under a growing percentage of real traffic before committing fully, though it comes with a longer deployment window and the cost of running both versions at full capacity throughout. Canary deployments split the difference: you shift a small percentage of traffic first (10% is the example given), observe behavior during a bake period, and then move the remaining traffic in a single step.

The choice between strategies is not only about team confidence; it also depends on the application itself. For workloads where user experience would differ based on which version processes the request (a common concern with LLM-based applications where model behavior changes between versions), blue/green provides consistency by avoiding the mixed-version state that linear and canary create during the traffic shift.

What This Means for Your Services

Amazon ECS Managed Instances and the native deployment capabilities represent a meaningful shift in how much operational responsibility the service takes on by default. ECS Managed Instances gives EC2-level flexibility without requiring you to own the infrastructure lifecycle. The deployment redesign removes CodeDeploy as an external dependency and brings lifecycle hooks, bake time, and rollback directly into ECS service configuration. Both features are built around practices the ECS team applies to Amazon's own internal fleets, including spread placement by default, the start-before-stop rule, and continuous rebalancing as a default service behavior.

If you are running on ECS on EC2 and your primary reason is instance type control, ECS Managed Instances is worth evaluating. If you are building or refining your deployment process, the native blue/green, linear, and canary options give you a structured path with built-in observability points rather than requiring a separate deployment toolchain.

Watch the full session: AWS re:Invent 2025 - Deep Dive: ECS Managed Instances & Blue/Green for Resilient Services (CNS416)

AWS
EXPERT
published 2 months ago133 views