re:Invent 2025 - Building event-driven architectures using Amazon ECS with AWS Fargate
This post covers session CNS307 from AWS re:Invent 2025. If your containerized microservices have ever buckled under peak traffic, this session offers a practical path to more resilient, scalable architectures using tools you may already have in place.
Building event-driven architectures with containers is not the first thing most teams consider, yet it is one of the most effective ways to eliminate tight coupling and scale through unpredictable demand. Eric Johnson, Principal Developer Advocate at AWS, and Matthew Meckes, Senior Containers Specialist at AWS, used this session to walk through the full journey: from a brittle synchronous architecture to a resilient event-driven system running on Amazon Elastic Container Service (Amazon ECS) with AWS Fargate. In this post, we'll explore the core problem with synchronous microservices, the event broker options available to you, and five concrete patterns you can apply immediately.
The problem with synchronous microservices
The session opens with Sarah, a fictional but familiar developer who watches her e-commerce platform collapse on Black Friday. Her architecture is built with modern microservices, deployed on Amazon ECS, and it still fails. The reason is not the compute choice. It is tight coupling. When the order service calls the loyalty service synchronously and the loyalty service slows down, the entire flow stalls. Every upstream caller waits, queues back up, and the system fails under load.
This is the core argument for event-driven architecture (EDA). EDA inserts an intermediary broker between producers and consumers, so the producer emits an event and moves on. Consumers react when ready. The producer no longer needs to know which services are downstream or how many of them exist. This decoupling allows you to scale each service independently, degrade gracefully when a dependency is slow, and add new consumers without touching the producer.
Johnson and Meckes are deliberate about one important point: the right response to an outage is not to throw away your compute layer and rewrite everything. If you are changing your architectural pattern, keep your compute stable. If you are changing your compute, keep your architectural pattern stable. An incremental migration, week by week toward EDA, puts you in a better position before the next big traffic spike than a big-bang rewrite would.
Choosing your event brokers
The session categorizes event brokers into three types, each suited to different situations.
Routers handle one event at a time and deliver it to multiple targets based on rules you define at the router. Amazon EventBridge is the primary example. It is serverless, natively integrates with most AWS services, and lets producers emit events without any knowledge of what consumes them. You write a rule in JSON, define your target (including an ECS task or a private API endpoint inside your VPC using PrivateLink and VPC Lattice), and EventBridge handles retry, dead-letter queues (DLQs), and throttling on your behalf. This keeps your producer and consumer code simple.
Topics, represented by Amazon Simple Notification Service (Amazon SNS), broadcast to multiple subscribers simultaneously and support filtering at the topic level. They are designed for fan-out at high throughput, with support for nearly unlimited messages per second and up to 12.5 million subscriptions per topic. First-in, first-out (FIFO) topics are available when ordering matters.
Queues, specifically Amazon Simple Queue Service (Amazon SQS), are a one-to-one pattern: each message is processed by a single consumer. Consumers poll the queue and control their own processing rate. Messages are made invisible during processing and reappear if the consumer does not confirm completion, giving you built-in retry without extra code. This is the right tool when you need to meter throughput or protect a downstream service from being overwhelmed.
For workloads already on Apache Kafka, Amazon Managed Streaming for Apache Kafka (Amazon MSK) provides a fully managed version of Kafka. Amazon Kinesis covers the same streaming use case for AWS-native implementations. In streams, all consumers receive all messages, so consumer-side filtering and ordering management are your responsibility.
Five patterns for ECS and EDA
Pattern 1 and 2: Public and private API integration. If your consumers expose an HTTP API, EventBridge API destinations let you push events securely to that endpoint. For publicly reachable services, IAM-authenticated calls from EventBridge to your ECS container's API work well and remove the need for a Lambda function as an intermediary. For private endpoints inside a VPC, EventBridge supports PrivateLink connections, giving you the same decoupling with full network isolation.
Pattern 3: Queue-based polling with autoscaling. Long-running ECS services poll SQS for messages and process them in batches. This model keeps your containers running continuously and lets you tune throughput directly in your polling code. The interesting part is how you scale. For predictable workloads, step scaling tied to the SQS queue depth (for example, add one task at 5 messages and two tasks at 15) works well. For workloads where message processing time varies widely, custom metric math using CloudWatch lets you combine queue depth and in-flight task count into a meaningful backlog-per-task metric. This prevents your autoscaling from reacting erratically while still responding accurately to real demand.
Pattern 4: EventBridge RunTask for event-based containers. Rather than running a persistent service, EventBridge can call the Amazon ECS RunTask API directly. One event triggers one task. This is a fire-and-forget model appropriate for work that is too long for AWS Lambda but episodic enough that a standing service would be wasteful. A good example from the session is video processing: short videos go to Lambda, longer ones spin up a dedicated ECS task. EventBridge rules can encode that logic with no application code changes.
Pattern 5: AWS Step Functions with ECS. AWS Step Functions adds orchestration on top of the RunTask pattern. You can run tasks asynchronously (fire and forget), synchronously using the .sync integration (Step Functions polls ECS task status and waits for completion), or with the callback token pattern (the ECS task holds a token, does its work, and calls the Step Functions SendTaskSuccess or SendTaskFailure API when done, returning result data back to the workflow). This last option is particularly useful when downstream steps depend on output from the container. For high-throughput batch scenarios, Step Functions Activities create a managed SQS queue that any number of ECS workers can poll from multiple workflows simultaneously. One team cited in the session used this pattern to reduce their Step Functions cost from $450 per invocation to $1 by shifting fine-grained processing into ECS workers instead of individual workflow state transitions.
Putting it together
By applying these patterns, the session's fictional architecture transforms from a chain of synchronous calls into a resilient system. The order service emits an order.created event to EventBridge. Payment processing becomes asynchronous, accepting the transaction optimistically and handling failures by messaging the customer. Customer notifications go through a separate EventBridge rule. The loyalty service processes orders from an SQS queue at its own pace, completely isolated from the order flow. Monthly batch processing of loyalty tiers runs through Step Functions Activities, distributing work across an ECS worker pool.
The result is a system where a slow loyalty service no longer takes down the checkout flow, and where each component scales based on its own load rather than the peak of the entire system.
If you are running containerized microservices and have not explored EDA, the starting point is small: pick one synchronous call that causes you pain, add a broker, and migrate that one integration. Your understanding of the patterns will grow with each iteration, and you will be in a better position for your next peak traffic event.
Watch the full session recording here: CNS307 - Building event-driven architectures using Amazon ECS with AWS Fargate
- Language
- English
Relevant content
- App with 2 microservices - should they run in two ECS clusters or two services in single ECS clusterAccepted Answerasked 3 years ago
- Accepted Answerasked 4 years ago
AWS OFFICIALUpdated 2 years ago