Skip to content

re:Invent 2025 - Future of Kubernetes

10 minute read
Content level: Advanced
0

This session covered recent enhancements to Amazon EKS and Amazon ECR, introduced new capabilities for large-scale AI/ML workloads, and showcased Netflix's migration journey. The talk explored innovations in control plane architecture, observability, and platform management that made Kubernetes more accessible at any scale

Speakers:

  • Mike Stefaniak, Head of Product, EKS and ECR, AWS
  • Eswar Bala, Director of Engineering, Containers, AWS
  • Niall Mullen, Senior Director, Cloud Infrastructure, Netflix

Kubernetes has reached a tipping point. According to the latest CNCF survey, 80% of enterprises now run Kubernetes in production, up from 66% the previous year. This growth reflects a fundamental shift in how organizations think about infrastructure management. As Mike Stefaniak explained during the session, Kubernetes succeeds because it wraps 15 years of operational complexity behind simple, declarative APIs that provide consistency across any environment.

AWS responded to this evolution with significant enhancements across Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon Elastic Container Registry (Amazon ECR). The core philosophy driving these improvements comes from a conversation Netflix initiated six to nine months before the conference: "We want to use Kubernetes, but we want to make scale and operations somebody else's problem."

Container Registry Innovations

Amazon ECR processes over 2 billion image pulls daily, serving as the foundation for containerized workloads across AWS. Recent enhancements address security, performance, and compliance requirements that emerge at scale.

The integration with Amazon Inspector provides enhanced container image scanning that goes beyond vulnerability detection. The new live inventory feature solves a critical operational challenge: when a scan identifies a vulnerable image, you can now see exactly where that image runs across your infrastructure. For organizations managing hundreds or thousands of clusters, this visibility transforms security response from a manual hunt into an automated inventory process.

ECR archival addresses the compliance burden many organizations face. Regulatory requirements often mandate retaining container images for years, even when those images will never run again. The new archival storage class provides lower-cost storage for these compliance-only images while maintaining the ability to restore them if needed.

The most recent addition, managed image signing, integrates with AWS Signer to automate cryptographic signing without requiring separate infrastructure. Every signing operation flows through AWS CloudTrail, creating an auditable chain of custody for your container images.

Making Kubernetes Upgrades Manageable

Kubernetes upgrades represent one of the most persistent operational challenges in the ecosystem. New versions release every four months, and staying current requires vigilance. AWS has committed to making each new Kubernetes version available in Amazon EKS within 45 days of upstream release, and has maintained this cadence consistently over the past two years.

EKS Cluster Insights scans your clusters daily to identify potential upgrade blockers. The system checks for deprecated APIs, outdated add-ons, and configuration patterns that might break in the next version. When Kubernetes 1.33 introduced changes to Amazon Linux 2 support, Cluster Insights immediately began flagging affected clusters. You can now refresh these insights on demand rather than waiting for the daily scan cycle.

The EKS global dashboard solves the inventory problem that plagues multi-account, multi-region deployments. AWS built what appears to be the first truly cross-account, cross-region dashboard in AWS, providing a single view of every cluster you operate. This executive-level visibility makes it possible to track upgrade status across your entire Kubernetes footprint without manually checking dozens of accounts.

Observability at the Network Layer

Networking issues account for the majority of Kubernetes troubleshooting tickets AWS receives. The team spent considerable time understanding these failure patterns before launching enhanced container network observability.

A single agent deployed in your cluster exposes metrics about DNS packet limits, retransmission timeouts, and cross-availability zone traffic patterns. The built-in service map visualization shows which pods communicate with each other, making it immediately obvious when a new deployment disrupts established traffic patterns. The system tracks connections to AWS services like Amazon S3 and Amazon DynamoDB, which matters for ML training workloads that frequently access S3 for dataset retrieval.

The EKS MCP server takes observability further by encoding seven years of EKS operational knowledge into a format that AI assistants can query. When you encounter a pod stuck in CrashLoopBackOff or notice unusual networking behavior, you can ask Amazon Q to analyze the issue. The system accesses the same troubleshooting runbooks that AWS support engineers use, providing guided resolution without opening a support case. The hosted version launched in preview integrates directly into the EKS console with full CloudTrail logging and enterprise security controls.

Platform Capabilities Beyond Clusters

EKS capabilities represents AWS's expansion beyond cluster management into the broader platform layer. The initial release focuses on two critical areas: application deployment and infrastructure provisioning.

Managed Argo CD takes the community-standard GitOps tool and handles the operational complexity. The integration with AWS Secrets Manager solves the persistent challenge of managing secrets in GitOps workflows. More significantly, AWS manages the networking synchronization traffic across accounts and regions automatically. When you deploy applications across multiple clusters in different AWS accounts, you don't need to configure VPC peering or transit gateways for Argo to function.

The AWS Controllers for Kubernetes (ACK) and Kubernetes Resource Orchestrator (KRO) capabilities enable infrastructure as code entirely through Kubernetes manifests. Rather than maintaining separate Terraform or CloudFormation templates, developers define S3 buckets, RDS instances, and ElastiCache clusters alongside their application deployments. KRO adds abstraction layers on top of ACK, letting platform teams publish internal APIs that abstract AWS service complexity behind organization-specific interfaces.

Scaling to Ultra Cluster Dimensions

The AI and ML workload explosion drives unprecedented scale requirements. Eswar Bala detailed how AWS re-architected core components of the EKS control plane to support what they call Ultra Clusters.

Traditional EKS clusters store all state in etcd, using the Raft consensus protocol to maintain consistency across three nodes. This architecture scales well to tens of thousands of pods, but AI training workloads demand clusters with 100,000 nodes and 800,000 GPUs. Three fundamental changes enable this scale:

First, AWS moved etcd's BoltDB storage from network-attached storage to an in-memory tmpfs implementation. This shift delivers order-of-magnitude improvements in both read and write performance.

Second, the team partitioned etcd's key spaces, allowing hot resource types to split across separate etcd clusters. Testing shows this delivers five times the write throughput while maintaining durability guarantees.

Third, and most significantly, AWS replaced Raft-based consensus with the same multi-AZ transaction journal that underpins many AWS services. This eliminates the need for etcd peer-to-peer communication and removes consensus algorithm constraints on replica scaling. The system maintains the same gRPC interface that upstream etcd exposes, preserving compatibility while dramatically improving performance.

The data plane received parallel improvements. Multi-network interface support enables pods to achieve 100 Gbps network bandwidth. Concurrent image download and unpacking using SOCI image pull technology cuts container startup time in half. Prefix delegation assigns CIDR ranges to instances rather than individual pod IPs, improving node launch rates threefold while optimizing VPC address space utilization.

Provisioned Control Planes for Predictable Performance

While Ultra Clusters target the highest end of the scale spectrum, AWS recognized that all customers benefit from predictable control plane performance. The new provisioned control plane option lets you select specific performance tiers with pre-allocated capacity.

Standard mode, which remains the default, scales the control plane dynamically based on workload demands. This works well for general-purpose workloads but introduces potential latency during scaling events. Provisioned mode eliminates this uncertainty by reserving specific capacity levels.

The tiers scale from handling 1,000 concurrent API requests to 6,800 requests at the highest level. Each tier maintains 16 GB of cluster database capacity, which testing shows covers most workload patterns. You pay for the reserved capacity but gain guaranteed performance during critical operations like rapid scaling events or cluster upgrades. You can switch between standard and provisioned modes at any time, adjusting capacity tiers as workload requirements evolve.

Netflix's Migration Journey

Niall Mullen shared how Netflix migrated its massive Titus container platform to EKS over approximately 11 weeks. Netflix operates at a scale that strains most infrastructure platforms: 300 million paying subscribers, personalization systems processing billions of predictions, and continuous video encoding that re-renders the entire content catalog using the latest codecs.

Netflix runs less than 20 production clusters across four core regions, with each cluster containing up to 10,000 large instances (primarily 24 and 48 XL sizes) and 80,000 pods. The regional availability model means Netflix can evacuate an entire AWS region in five minutes when necessary. During an AWS outage in October, Netflix shifted out of US-East-1 in 15 minutes.

The most demanding scenario combines steady-state operations with disaster recovery during peak traffic. When 100 million people watched a live sporting event and AWS experienced an issue in US-East-1 simultaneously, Netflix needed to launch 70,000 containers in five minutes. Three years ago, when Netflix first approached AWS about EKS adoption, the answer was "no way" to supporting these launch rates. Progress by Pinterest through 2023 brought EKS scaling to a point where Netflix only needed a doubling of capacity, making migration feasible.

Netflix spent nine months working with the EKS team on scale improvements and integrations. The team consolidated their regional control planes into dedicated accounts, integrated Netflix's identity model with IAM for control plane access, and established CloudWatch and Prometheus monitoring integrations. They then migrated the entire fleet in a single quarter.

Netflix has since migrated their federation layer (which routes pods to appropriate clusters) to etcd and is now replacing their custom Virtual Kubelet implementation with the standard Kubelet. This positions them to potentially adopt EKS Hybrid Nodes for edge use cases and EKS Auto Mode to eliminate OS management overhead.

The Three-Year Roadmap

AWS outlined priorities for the next three years across five dimensions. Critical workload patterns at any scale means supporting not just larger individual clusters but easier management of workloads distributed across multiple clusters. AWS integrations continue to expand, recognizing that many customers access AWS services primarily through Kubernetes rather than directly through AWS APIs.

Meeting workloads where they are spans the spectrum from EKS Distro (take Kubernetes anywhere, including fighter jets) to fully managed EKS in the cloud, with EKS Anywhere and EKS Hybrid Nodes covering on-premises deployments. The team continues improving the AWS Outposts story to support new SKUs and server types.

Simplifying platform building reflects the core philosophy that drove the session: you should be able to use Kubernetes without operating it. The launch of managed Argo CD exemplifies this approach. AWS takes community standards and manages them rather than building proprietary alternatives. Karpenter represents a case where AWS determined existing solutions (cluster autoscaler) could be fundamentally improved, and the resulting project became a new cross-cloud standard.

The ultimate vision suggests a future where you hand AWS an application manifest and don't think about clusters at all. AWS handles the federated layer above clusters, determining optimal placement across your infrastructure automatically.

Key Takeaways

Kubernetes has evolved from a container orchestrator into the substrate for AI, ML, web applications, data processing, and stateful workloads. AWS focuses on reducing the operational burden while maintaining full conformance with upstream Kubernetes. Every enhancement ships vanilla Kubernetes with no proprietary modifications.

The innovations in control plane architecture that enable Ultra Clusters benefit all Amazon EKS users through improved performance and reliability. Provisioned control planes give you predictable performance when you need guaranteed capacity. Enhanced observability, particularly for networking, provides the visibility needed to troubleshoot issues without opening support cases.

EKS capabilities mark a shift from managing clusters to managing platforms. As developer familiarity with Kubernetes decreases (counterintuitively indicating success), Kubernetes becomes infrastructure that developers use without thinking about it, similar to how Linux operates today as a largely transparent layer in the stack.

Organizations can adopt Kubernetes without becoming pure technology companies. The platform handles complexity while providing the consistency, extensibility, and simplicity that made Kubernetes the standard for container orchestration.


Full session recording is available on: Future of Kubernetes - AWS re:Invent 2025