AWS re:Invent 2024 - Building production-grade resilient architectures with Amazon EKS

9 minute read
Content level: Expert
0

This blog post summarizes key highlights from the AWS re:Invent 2024 session "Building production-grade resilient architectures with Amazon EKS" presented by Carlos Santana and Niall Thomson from AWS. The session explored practical approaches for managing Amazon EKS clusters at scale, covering GitOps-based management, effective observability, and governance strategies that platform teams can implement immediately.

What happens when your organization needs to manage tens or even hundreds of Amazon Elastic Kubernetes Service (Amazon EKS) clusters across different teams and environments? At AWS re:Invent 2024, Carlos Santana, Senior Specialist Solutions Architect, and Niall Thomson, Container Specialist Solutions Architect at AWS, tackled this challenge head-on in their session about building resilient Amazon EKS architectures.

The Growing Complexity of Kubernetes Management

As organizations adopt Kubernetes, the number of clusters they manage tends to grow rapidly. Carlos highlighted an impressive statistic: as of July 2024, there has been a 33% year-over-year increase in Amazon EKS clusters managed by AWS. This growth brings significant challenges for platform teams who need to maintain consistency while supporting diverse teams and applications.

The scenario is familiar to many organizations. You start with one Amazon EKS cluster and a pipeline to deploy it. Then another team requests a cluster, so you copy and paste your pipeline, making slight modifications. Before long, you're managing dozens of slightly different pipelines, each with its own version and configuration – what Carlos called "snowflakes".

"When you have unmanaged growth of clusters, enforcing standards becomes hard to automate," Carlos explained. "Out-of-date management becomes something that gets out of control with so many open source projects. The team cannot be an expert on every add-on out there that you put in your clusters."

Approaches to Serving Teams with Amazon EKS

Organizations typically adopt one of several models for providing Kubernetes capabilities to their development teams:

Templates as a Service offers a low barrier to entry – the platform team creates templates that development teams use to provision their own clusters. However, Carlos pointed out a significant drawback: "Development teams, guess what? They don't upgrade their clusters". This approach becomes difficult to manage at scale.

Cluster as a Service involves platform teams taking more ownership, managing the clusters themselves while providing access to development teams. This approach keeps clusters updated and properly configured.

For more mature organizations, Namespace as a Service provides development teams with isolated namespaces with predefined resource quotas. "The platform team would have a discussion or maybe a form that asks the team, like, how many resources do you need in that namespace?" Carlos explained.

Some organizations take this even further with an application-centric approach where developers simply commit their code, and the platform handles everything from containerization to deployment – completely abstracting away the Kubernetes layer.

GitOps: Bringing Order to Cluster Management

Both speakers emphasized GitOps as an effective methodology for managing multiple Amazon EKS clusters. With GitOps, you define your desired state in Git repositories and automated processes maintain your clusters to match that state.

"By using the iterative configuration, it reduces complexity", Carlos noted. "When using Git, you can track changes and enhance level of visibility. You can even roll back".

Carlos walked through how GitOps can be applied to Amazon EKS, focusing on three main components:

The Control Plane includes configurations like encryption settings, cluster access management, and Kubernetes version upgrades – all managed through AWS APIs.

For the Data Plane, tools like Karpenter help manage worker nodes by provisioning Amazon EC2 instances based on workload requirements through Kubernetes-native APIs.

Add-ons comprise system-wide software deployed on your clusters, including both AWS-provided Amazon EKS add-ons and open-source components.

Carlos demonstrated a reference architecture using AWS Controllers for Kubernetes (ACK) – an open source project fully supported by AWS – to provision and manage Amazon EKS clusters through Kubernetes-native APIs, with Argo CD handling the deployment of add-ons and workloads.

Implementing Resilient Upgrade Strategies

One of the most valuable insights from the presentation was how to implement resilient upgrade strategies across multiple clusters. Carlos explained that the Amazon EKS service team itself faces this challenge at massive scale when rolling out patches across all customer clusters worldwide.

"Without velocity, they fall behind, cannot upgrade everything serially", Carlos explained. "Some organizations have hundreds of Amazon EKS clusters. Upgrading in batches, however, needs safeguards in place for resiliency and availability".

Taking inspiration from the Amazon EKS service team's approach, Carlos suggested organizing clusters into "cells" and "waves":

"A cell could be one unit of work, it could be one cluster for example," Carlos explained. "How many cells you can do in one wave? Well, it matters managing the velocity and resiliency. As you progress, the confidence tends to increase."

Amazon EKS Cluster rollout

As you can see on the above image, the recommended approach follows a pattern:

  1. Start with a sandbox cluster where the platform team tests their change
  2. Progress to development clusters
  3. Move to staging clusters with more comprehensive testing
  4. Finally, roll out to production clusters in phases

For each wave, the process includes pre-checks (such as using Amazon EKS Upgrade Insights), update procedures, post-update tests, and soak time for monitoring. As confidence increases with each successful wave, the soak time decreases and the number of clusters per wave increases.

Observability: Building Trust in Your Platform

Niall shifted the discussion to observability, emphasizing its critical role in maintaining trust between platform teams and application teams.

"At the end of the day, regardless of whether we have one cluster or 100 clusters, things are going to break", Niall said. "We need a strategy to find issues, detect them, help us remediate them. And if we can't do that as a platform team, then we can't operate a service that our developers can rely on".

In organizations with a platform team/application team structure, the platform team is responsible for the health of the clusters. "For that aspect of trust, we don't want, ideally, our customers to even notice if there's a problem", Niall explained. "If something breaks in some of our clusters, we want to know about it and hopefully fix it before they even know that it's there".

While dashboards are an important starting point, Niall explained that they don't scale effectively when monitoring multiple clusters. "As soon as you start to monitor a handful of clusters, you're not going to be sitting, staring at dashboards waiting for a bomb to happen. You need something that scales a bit more effectively than that".

Instead, platform teams need to implement proactive alerting systems with well-documented runbooks. These runbooks not only help on-call engineers quickly resolve issues but also pave the way for automated remediation – a growing trend Niall observed at recent Kubernetes conferences.

These observability practices directly support the continuous delivery process Carlos described. When rolling out changes across waves of clusters, monitoring and alerts act as circuit breakers that can automatically stop a rollout if problems are detected.

Niall mentioned the recently launched improved Amazon EKS cluster control plane monitoring, which provides additional metrics for better visibility into cluster health. This feature gives platform teams more data points to monitor their clusters for optimal performance.

Managing Cluster Inventory at Scale

As your Amazon EKS fleet grows across regions and AWS accounts, keeping track of all your clusters becomes increasingly challenging. Niall recommended using developer portals like Backstage to maintain a comprehensive inventory.

"We see people doing it with vendor tools. We see them using Grafana and observability tooling. One thing that we started to see is folks using developer portals like Backstage", Niall said.

These portals can provide a unified view of all clusters with basic information like account ID, region, and Amazon EKS version. You can also integrate cost data from tools like Kubecost, policy violation findings, and historical SLO performance.

"We're trying to get rid of that big bookmark folder that you have where, especially if you've got 100 clusters, you're going to have a pretty bad time in the browser trying to keep track of all this stuff", Niall explained.

He also highlighted a recently open-sourced plugin for Backstage that can ingest AWS infrastructure details directly from AWS Config, making it easier to build and maintain this inventory automatically.

Governance: Maintaining Consistency Across Clusters

The final section focused on governance, which becomes crucial as the number of clusters increases. While GitOps provides consistency at the cluster level, you also need guardrails for what developers deploy within the clusters.

Niall explained how tools like Open Policy Agent (OPA), Gatekeeper, and Kyverno can help implement policies as code. These policies can address common challenges that platform teams face:

"I've talked to so many customers where they're Kubernetes team or platform team are unable to upgrade clusters because of what developers have deployed to them", Niall said. Issues like deprecated APIs, problematic Pod Disruption Budgets, or applications deployed with single replicas can complicate cluster upgrades and create availability issues.

For consistency across clusters, Niall recommended deploying a single Helm chart containing all policies to every cluster, with values files to enable or disable specific policies based on cluster type or environment.

"Starting to patchwork, deploying one policy to one cluster and one to another, will work early on. But as you start to get up to a larger scale, you're going to need to just start to say, I'm throwing everything everywhere", Niall advised.

As your Amazon EKS fleet grows, you'll also need a centralized way to track policy violations. Solutions like Kyverno's Policy Reporter can aggregate findings into AWS Security Hub, allowing you to search, review, and remediate issues across your entire fleet.

Bringing It All Together

Building production-grade resilient architectures with Amazon EKS requires attention to management, observability, and governance. These three elements work together to create a platform that can scale to dozens or hundreds of Amazon EKS clusters while maintaining reliability and consistency.

Carlos and Niall created a hands-on workshop covering these topics, available on GitHub. "In the workshop, we have an example of using Terraform. A lot of folks use Terraform. So the pattern shows you how to manage Terraform for AWS APIs and then Git Argo CD for the GitOps", Carlos explained.

For those interested in diving deeper, additional resources include the Amazon EKS Workshop for hands-on labs and the Amazon EKS Best Practices Guide, which is now part of the official documentation.

For those interested in watching the full session, including detailed explanations and demonstrations from Carlos Santana and Niall Thomson, the recording is available on the AWS YouTube channel.