re:Invent 2025 - Designing mission critical applications with serverless services
Booking.com spent three years replacing a 20-plus-year-old Perl reservation system that handled millions of daily reservations. Session CNS362 at re:Invent 2025 documents that journey and pairs it with practical patterns for decomposing tightly coupled monoliths and rethinking how distributed services communicate.
Many engineering teams recognize the same pattern: a codebase that once deployed multiple times per week now ships monthly, testing cycles stretch into hours, and rollbacks are routine. In session CNS362 at re:Invent 2025, Luca Mezzalira, Principal Serverless Specialist at Amazon Web Services, Sara Gerion, Principal Solutions Architect at Amazon Web Services, and Zeeshan Pervaiz, Engineering Manager at Booking.com, addressed this directly through a combination of architectural principles and a detailed case study. In this post, we'll walk through Booking.com's modernization of their accommodation reservation backend (ARB), the decomposition patterns that make a migration tractable, and how serverless changes the way services communicate.
Luca opened the session by framing what distributed systems are designed to achieve: organizational scalability, business agility, faster feedback loops measured through DORA (DevOps Research and Assessment) metrics, and limiting the impact when individual services fail. He made the case that expressing modularity at the infrastructure level through serverless services is more durable than enforcing it through code conventions alone. Code-level discipline erodes over time as teams change. Infrastructure-level constraints are built into the service itself.
How Booking.com modernized a mission-critical reservation system
Booking.com's ARB started as a small, cohesive Perl codebase. By the time Zeeshan's team took on modernization, it carried 90-plus dependencies, more than 80% of which were hard dependencies with cascading failure behavior. Domain boundaries had dissolved: the reservation system was calculating commissions, writing charges to databases owned by other teams, and absorbing logic from pricing, fraud detection, and customer service workflows. The system ran on bare metal servers, which required regular security patch cycles, manual capacity checks, and hardware maintenance.
The human cost was measurable. Onboarding a new developer took four months and still produced only partial domain coverage. Unclear ownership slowed debugging during outages, and keeping the lights on (KTLO) consumed capacity that could have gone to new features. The team committed to modernizing using serverless, starting with a narrow proof-of-concept: a Lambda function making a test call to an on-premises service to verify connectivity and latency. When it worked, confidence in the approach grew.
The architecture that emerged uses AWS Step Functions to orchestrate 15 AWS Lambda functions, with Amazon DynamoDB storing request data temporarily using a time-to-live (TTL) to keep the table size stable. At the output stage, Amazon SNS fans out to Amazon SQS queues, decoupling downstream processing from the main workflow. Step Functions provides catch blocks for error handling out of the box, removing a category of boilerplate the team no longer has to write or maintain.
Lambda cold starts in a Java runtime were the most significant technical challenge. Testing provisioned concurrency at 0.5 GB memory, provisioned concurrency at 1 GB memory, and Lambda SnapStart, the team found provisioned concurrency at 1 GB as the best starting configuration. Latency was still at 3,200 milliseconds. Through SDK optimizations and Java initialization improvements that pre-warmed objects at startup, they brought that down to 140 milliseconds. For go-live validation, the team ran shadow traffic (duplicating production requests to the new system and comparing outputs against the legacy system) from 1% to 100% over a month before scaling live traffic in stages: 1, 10, 30, 45, 90, and 100 percent.
The outcomes were concrete. New developers now produce their first merge request (MR) within their first month, compared to four months previously. Amazon CloudWatch metrics, AWS X-Ray tracing, and Step Functions execution views cut root cause analysis to hours. The reservation creation pathway now runs at 100% on the new system, and the legacy endpoints have been decommissioned.
Patterns for decomposing the monolith
Sara introduced the broader context for modernization: it is not purely a technology problem. It involves people, processes, and technology together, and a weakness in one area creates ripple effects across the others. With that framing, she walked through the decomposition patterns that most commonly apply when moving from a monolith to serverless services.
The Strangler Fig pattern handles incremental replacement. A routing layer directs traffic between the legacy system and the newly modernized component, allowing both to coexist until the legacy component can be retired. Branch by abstraction is useful for shared capabilities, hiding the complexity of running two implementations simultaneously behind an abstraction so that dependent services do not need to change during the migration.
Decompose by transaction maps particularly well to serverless adoption. When a customer creates an order, they expect an immediate response. When a customer cancels an order, they tolerate a short delay while the system processes downstream actions. These different user expectations call for different architectures. Order creation fits Amazon API Gateway, Lambda, and DynamoDB returning a fast synchronous response. Order cancellation fits API Gateway and Step Functions managing a longer orchestration workflow. Decomposing by transaction gives each flow an architecture suited to its actual latency requirement rather than forcing both through a single design.
Rethinking how services communicate
Sara closed the session with a gift code example that illustrated how serverless changes the assumptions architects make about synchronous versus asynchronous communication. A gift code validation flow that calls an external CRM system synchronously works under normal conditions but creates a dependency on that CRM's response time during peak periods, precisely when latency has the most impact on customer experience.
The serverless approach decouples the immediate customer response from the slower validation process. When a customer enters a gift code, it is saved immediately to DynamoDB. Amazon EventBridge Pipes picks up the DynamoDB stream event and handles propagation to external validation systems: loyalty program, fraud detection, and the CRM. EventBridge Pipes also transforms the event payload to match the format each external system expects. If validation fails, the existing notification service informs the customer. The customer receives an instant acknowledgment and continues shopping without waiting for the external systems to complete.
The important point Sara made is that this is not an argument for making everything asynchronous. The initial customer response is synchronous, handled by API Gateway, Lambda, and DynamoDB. The external validation runs asynchronously through EventBridge. Serverless makes it practical to mix these patterns at the right points in a workflow, which means architects can match the communication model to the actual experience requirement rather than applying one approach throughout. As Sara concluded: distributed systems are not static. They evolve because the organizations running them evolve, and the decision to adopt serverless is as much a strategic choice about how teams deliver value as it is a technical one.
Watch the full session: re:Invent 2025 CNS362 - Designing mission critical applications with serverless services
- Language
- English
Relevant content
- asked 4 years ago
- Accepted Answerasked 2 years ago
- asked 10 months ago
