Skip to content

Sustaining peak performance: How Stripe powered a record-breaking Black Friday with AWS Enterprise Support

12 minute read
Content level: Advanced
0

This article explores how AWS Enterprise Support helped Stripe achieve record-breaking success during the Black Friday and Cyber Monday (BFCM) period in 2024. This case study explores the strategic partnership between Stripe and AWS Enterprise Support. It also offers valuable insights, best practices, and practical implementation tips that organizations of any size can apply to their own high-stakes events.

Introduction

In the fast-paced world of ecommerce, the BFCM period can push online payment systems to their limits. During the 2024 BFCM period, Stripe, a leading programmable financial services platform, was prepared for record-breaking transaction volumes. In BFCM periods, Stripe had usually processed millions of transactions that amounted to billions of dollars while maintaining high availability for its payment API. Faced with a significant projected increase in transaction volume from 2023, Stripe needed to scale its infrastructure while preserving speed, reliability, and security. With these goals in mind, Stripe contacted AWS Enterprise Support for a strategic partnership to start preparations for the BFCM period months in advance.

At the core of this partnership is the Technical Account Manager (TAM). They act as a dedicated advisor who guides customers through their AWS journey and provides resources to meet specific business needs. The Enterprise Support tier provides customers with general guidance, best practices, troubleshooting, and operational support throughout their cloud journey.

Abhisek Chatterjee, Stripe Head of core Infra, said, "During the four days of BFCM 2024, Stripe processed over 465 million transactions totaling more than $31 billion in payment volume. Our APIs maintained greater than 99.9999% uptime throughout the entire period. This level of reliability was made possible in large part due to the exceptional support we received from AWS. Their dedicated team partnered closely with us through every phase—from pre-event preparation and capacity planning to DDoS mitigation and live incident monitoring. We deeply appreciate AWS's responsiveness, technical depth, and proactive engagement".

Solution overview

AWS Countdown, a feature that's included in the AWS Enterprise Support plan, helps you prepare for critical events. You can start to use this feature several months before the event. As the event date approaches, the feature becomes more involved. For this case study, Stripe used AWS Countdown to prepare for their BFCM period months in advance.

Event preparation

Initial planning that happens several months before the event

Your AWS journey begins with comprehensive planning sessions. In these sessions, your TAM helps define clear success metrics and forecast resource requirements based on historical data and growth projections. With tools such as AWS Trusted Advisor, you can conduct thorough architecture reviews to identify potential bottlenecks and areas of concern. In this phase, TAMs help you create detailed runbooks and scaling plans that align with your business objectives.

Infrastructure optimization that happens weeks before the event

As the event approaches, the focus shifts to infrastructure optimization. You can evaluate architecture against the AWS Well-Architected Framework to make sure that it aligns with AWS best practices. Based on these assessments, you can implement the necessary upgrades and optimizations to enhance the resilience of your system. Load testing is a critical part of this phase. Use solutions, such as Distributed Load Testing on AWS, to simulate peak conditions and validate system performance.

Final preparations that happen a few days before the event

In the days leading up to the event, TAMs conduct daily check-ins to address last-minute concerns and validate readiness. You can fine-tune configurations based on final load test results and predictions. You can also work with TAMs to establish clear communication channels and incident response procedures. This makes sure that all teams understand their roles and escalation paths.

Event monitoring during the event

Use Amazon Managed Service for Prometheus and custom dashboards for comprehensive real-time monitoring of the event. You can use monitoring metrics, such as transaction throughput, API response times, error rates, auto-scaling activity, and database performance. Custom dashboards display visual representations of these metrics so that you can quickly identify trends or anomalies. You can run predefined response plans with support from dedicated AWS teams. To maintain clear communication with stakeholders, you can also send regular status updates through established protocols.

Post-event analysis

After the event, you can conduct root cause analyses with Amazon Managed Prometheus/Grafana. You can also use your preferred monitoring tools to collect detailed metrics. This data helps document lessons learned and identify areas for optimization. Some example insights include identifying opportunities to optimize database queries by adding missing indexes and improving API endpoint performance during peak hours. Other opportunities include enhancing cache strategies to maintain higher hit rates. The insights that you get create the foundation to develop action items, prepare for future events, and create a continuous improvement cycle.

TAM coordination and cross-team collaboration

Throughout this process, the TAM serves as your primary point of contact. The TAM coordinates efforts across various AWS teams and your organization to create a smooth and successful event. In Stripe's case, their TAM team collaborated with various internal teams, such as AWS Capacity, DDoS Response, AWS Backbone Engineering, and Amazon Managed Grafana. This collaboration made sure that Stripe's infrastructure was fully prepared for the unprecedented BFCM transaction volumes. In the subsequent sections, the article covers how these preparations helped Stripe with their best BFCM week ever.

Note: AWS also offers an AWS Countdown Premium tier. This tier provides critical support with designated engineers across all phases of your cloud projects, from design to post-launch retrospectives.

Capacity management and infrastructure optimization

Capacity management and infrastructure optimization are required to make sure that an organization's infrastructure can meet current and future business needs efficiently and cost-effectively. To create a smooth experience, Stripe and the AWS Support team completed the following preparations:

Extensive load testing

To simulate peak BFCM period conditions, Stripe conducted extensive load tests. Through these tests, the team could identify and address potential bottlenecks before they might potentially occur.

Proactive capacity planning

To maintain consistent Amazon Elastic Compute Cloud (Amazon EC2) capacity for routine and peak operations, including critical periods, Stripe used On-Demand Capacity Reservations (ODCR). To prepare for the BFCM period, Stripe undertook significant projects, including serverless application migration to different accounts and multiple active AWS Regions to serve traffic during disaster times. For the peak event, the TAM team delivered thousands of Amazon EC2 instances. To improve resource allocation and waste monitoring, Stripe requested AWS-assisted ODCRs with custom tagging. With these ODCRs, Stripe could provide capacity more efficiently to the teams that needed it.

Robust contingency planning

To make sure that business continued, even in worst-case scenarios, Stripe created contingency plans. These plans included failover systems, such as multi-Region database replication, to make sure that the BFCM period was a success. For example, Stripe configured Amazon Route 53 DNS failover to automatically redirect traffic to backup Regions if there are any issues with a primary Region. This maintained seamless continuity of payment processing. These failover systems and contingency measures helped Stripe maintain service levels despite any major disruptions that might occur during the BFCM period. For more information on these AWS best practices, see AWS Well-Architected Framework.

DDoS readiness and security enhancement

During high-profile events such as BFCM, online services can potentially become potential targets for bad actors. In particular, Distributed Denial of Service (DDoS) attacks pose a significant threat to ecommerce platforms and payment processors. These attacks can overwhelm servers, disrupt services, and potentially lead to substantial financial losses and damage to a company's reputation.

Stripe and the Enterprise Support team partnered with AWS Shield Advanced to strengthen its DDoS protection capabilities.

The implementation included:

  • Integration with AWS Shield Advanced for enhanced DDoS protection
  • Optimization of security configurations to establish rapid response to potential threats
  • Establishment of streamlined incident response protocols
  • Collaboration between the TAM team and AWS Shield Response Team (SRT) to review and optimize security configurations

These measures helped Stripe protect its infrastructure while maintaining consistent service availability during peak shopping periods.

Service readiness and performance optimization

TAMs assessed and confirmed Stripe's infrastructure readiness for the BFCM period with comprehensive service reviews and optimization activities. The TAM team identified the following critical services that experienced peak loads during the BFCM period:

  • Amazon Aurora databases: Critical for handling the massive influx of transaction data, Aurora's performance directly affects payment processing speed and reliability. Database optimization is required for rapid, consistent data access and storage during peak transaction periods.

  • Amazon Managed Service for Prometheus and Amazon Managed Grafana: These services are critical for Stripe's observability stack. Real-time monitoring during the BFCM period helps quickly identify and resolve issues. Stripe's team used these services to visualize system performance, detect anomalies, and respond to potential problems before they could affect customers.

  • Amazon EC2: EC2 instances run the core payment processing applications for Stripe. Stripe needed sufficient capacity and an optimal configuration for these instances to maintain processing speed and reliability under extreme loads.

  • AWS networking and content delivery: With millions of transactions flowing through the system, network performance might be a potential bottleneck. Stripe needed to optimize their networking to maintain low-latency communication between services and reliable connectivity for merchants and customers.

The collaboration between TAMs, AWS service teams, and Stripe engineering teams strengthened service reliability. With this enhanced reliability, Stripe could successfully handle the unprecedented transaction volumes during the 2024 BFCM period.

Service quotas review and proactive management

Service quotas review was an important pre-event assessment for this case study. AWS Enterprise Support conducted a granular analysis of Stripe's existing AWS service quotas across critical domains, such as compute, storage, network resources, and observability platform. The team systematically mapped out potential constraint points and projected peak loads to develop a strategic approach. That way, they could preemptively request quota increases and optimize resources.

This review involved an evaluation of service quotas for API call rate quotas and the following services:

In addition to core infrastructure services, Stripe engineers considered observability as an important platform. During high-stakes transaction periods, the observability platform became a necessary troubleshooting tool. The platform allowed engineers to rapidly diagnose and mitigate issues with real-time visibility into system performance, transaction flows, and potential bottlenecks across hundreds of microservices. The TAM worked with service teams to help Stripe's observability platform access the required number of metrics. The TAM also made sure that Stripe engineers could seamlessly access and interact with dashboards during critical troubleshooting scenarios.

This comprehensive service quotas review helped Stripe flexibly scale without any restrictive service quotas during the high-intensity BFCM period. Proactive management prevented potential performance bottlenecks and provided a robust and scalable cloud infrastructure that could handle unprecedented transaction volumes.

Incident management and rapid response

The implementation of AWS Incident Detection and Response for critical workloads was an important element of the preparation. This is a comprehensive service that's designed to help organizations quickly identify, analyze, and respond to operational events within their AWS environments. AWS Incident Detection and Response facilitates collaboration with AWS to develop runbooks and response plans for your workloads. In this case, the AWS Enterprise Support team worked with Stripe to identify critical workloads and onboard them into this service. Also, AWS TAMs worked with the relevant service team to create customized runbooks that Stripe use during critical events.

As the BFCM period approached, the TAM team maintained regular check-ins with Stripe to make sure that the preparations occurred as planned. These sessions included synchronized updates with AWS service team leadership and Stripe to make sure that all parties were aligned on preparations and expectations. AWS established dedicated communication channels for immediate escalation, including direct access to specialized AWS Support teams. This was required to make sure that Stripe received immediate assistance for any critical issues that occurred during the BFCM period.

Conclusion

Stripe was successful with the 2024 BFCM period and broke their previous records. During this period, Stripe processed 465 million transactions with a total payment volume of more than $31 billion. This period was the largest ever 4-day period on Stripe. Also, Stripe's API maintained an uptime of more than 99.9999%.

For organizations that prepare for high-stakes events, AWS Enterprise Support provides the expertise, tools, and support to achieve success at scale. To learn more about our plans and offerings, see AWS Support.


About the authors

Enter image description here

Jyothsna Yarlagadda

Jyothsna Yarlagadda is a Senior TAM with AWS Enterprise Support. She works closely with Financial Services customers to help them optimize their cloud infrastructure securely. In her spare time, Jyothsna likes to spend quality time with her children and pursue her passion for baking.

Enter image description here

Mateus Prado

Mateus Prado is a TAM at AWS. He lives in the Dallas-Fort Worth metropolitan area and is passionate about software engineering for critical-scale environments. He enjoys working with customers to determine the root cause of complex issues. In his spare time, Mateus enjoys smoking meats, producing charcuterie, and playing the drums.

Enter image description here

Mohan Musti

Mohan Musti is a Dallas-based Principal TAM at AWS. Mohan helps customers architect and optimize applications on AWS. In his spare time, Mohan enjoys spending time with his family and camping.

Enter image description here

Prashob Krishnan

Prashob Krishnan is a Denver-based TAM at AWS. He is passionate about security and enjoys working with customers to solve their technical challenges and build secure scalable architecture in the AWS Cloud.