Ongoing service disruptions
For the most recent update on ongoing service disruptions affecting the AWS Middle East (UAE) Region (ME-CENTRAL-1), refer to the AWS Health Dashboard. For information on AWS Service migration, see How do I migrate my services to another region?
re:Invent 2025 - Under the hood: Architecting Amazon EKS for scale and performance
AI and machine learning workloads have made 100,000-node Kubernetes clusters a practical requirement, not a theoretical limit. This session goes inside the architectural changes AWS made to etcd, the data plane, and the control plane to support that scale, and explains how those same innovations are now available to every EKS customer through a new tiered control plane feature.
Running large-scale AI training and inference on Kubernetes exposes the hard limits of what the control plane was originally designed to handle. At tens of thousands of coordinated instances working as a single system, every bottleneck in the data store, scheduler, and networking stack becomes a constraint on the training job itself. In session CNS429 at AWS re:Invent 2025, Sheetal Joshi, Principal WW Specialist SA for Containers at AWS, and Raghav Tripathi, Engineering Leader at Amazon EKS, walked through the specific architectural rethinks that enabled Amazon EKS Ultra-Scale Clusters and the newly launched Provisioned Control Plane feature. Nova DasSarma, Member of Technical Staff and infrastructure lead at Anthropic, joined to share how her team operates more than 99% of Anthropic's compute on EKS at a scale that has directly shaped both features. In this post, we'll explore what changed architecturally to reach 100,000 nodes in a single cluster and what it means for your own EKS workloads.
Why 100,000-node clusters and why a single cluster matters
The case for very large single clusters comes down to how AI training jobs are structured. A single training job at Anthropic's scale can occupy a single namespace or a single StatefulSet, spanning thousands of accelerated instances that need to act as one coordinated system. Splitting that across multiple clusters would require coordinating across cluster boundaries, adding the very complexity that Kubernetes was designed to eliminate. As Nova DasSarma put it, putting a single application across multiple clusters means building a Kubernetes on top of another Kubernetes. The framework support simply doesn't exist to manage those workloads across cluster boundaries in a practical way.
The economics of multi-tenant GPU infrastructure reinforce this. Every idle instance or slow scale-up represents utilization that could go toward another training run. A single logical resource pool, even when it spans availability zones, makes it possible to share capacity across teams and workloads without running separate clusters per team. The number of GPU-powered instances running on Amazon EKS has doubled in the past year, and according to Gartner, 95% of new AI deployments will use Kubernetes by 2028. The demand is already here.
Rebuilding etcd and the data plane from the ground up
EKS Ultra-Scale Clusters support up to 100,000 nodes in a single cluster, which at full capacity represents 800,000 Nvidia GPUs or approximately 1.6 million AWS Trainium chips. Getting there required rethinking the three core bottlenecks in the standard Kubernetes architecture.
In a standard EKS cluster, etcd uses the RAFT consensus algorithm across a three-node cluster to maintain consistency. Every write requires RAFT coordination before it is committed, which works well at regular scale but becomes a throughput ceiling as the cluster grows. The first change was offloading consensus from etcd to a purpose-built multi-AZ transaction journal within AWS. With a specialized system handling durability and consensus, etcd is no longer bound by quorum requirements and can scale horizontally rather than being locked into three-, five-, or seven-node configurations.
The second change followed directly from the first. With durability handled externally, the team moved cluster state from Amazon EBS volumes backed by BoltDB to an in-memory database using tmpfs. Reading and writing from memory instead of disk produces an order-of-magnitude improvement in throughput and allowed the team to push the etcd database size limit for ultra-scale clusters to 20 GB, 2.5 times the standard EKS limit. The third change was partitioning the highest-traffic key types (nodes, pods, leases, and events) into separate dedicated etcd instances rather than sharing a single key-value store, allowing each high-traffic path to scale independently.
The combined results: peak read throughput of 7,500 requests per second, peak write throughput of 8,000 to 9,000 requests per second, P99 read and write latency between 100 milliseconds and 1 second, and list request response times of 5 to 20 seconds, well within the 30-second upstream service-level objective. These measurements were taken while running mixed workloads including a 100,000-node pre-training job using StatefulSets, parallel fine-tuning jobs with 10,000 nodes each, and real-time inference on Llama 3.2, all in a single cluster simultaneously.
The data plane received three parallel optimizations. AI and machine learning container images often exceed 5 GB and in Anthropic's case reach 35 GB, so image pull time directly determines how quickly a failed node can be replaced during a training run. Using the AWS SOCI (Seekable OCI) snapshotter to parallelize container download and unpacking, combined with a change to the Amazon VPC CNI plugin that gives a single pod access to all network cards on the instance (up to 100 Gbps of bandwidth), reduced the time from pod scheduling to running with all required data by 3x. Nova noted this reduced P50 parallel pull time by approximately five minutes across thousands of servers, which is often the determining factor in mean time to recovery after a node failure. Karpenter was also updated to pre-assign IP prefixes to pods during node launch rather than reactively through the CNI afterward, reducing node readiness time at a scale where each second saved compounds across thousands of concurrent launches.
Provisioned Control Plane and what Anthropic learned
Not every workload needs 100,000 nodes, but many workloads need more predictable control plane behavior than reactive auto-scaling provides. Provisioned Control Plane takes the architectural work behind ultra-scale and makes it available in a tiered form to all EKS customers.
Standard Control Plane continues to auto-scale reactively. When the cluster detects increased demand such as a surge in API traffic or a large pod deployment, it scales up in approximately 10 minutes. This remains the right choice for workloads with variable or unpredictable demand. Provisioned Control Plane introduces three additional tiers (XL, 2XL, and 4XL) for customers who need predictable capacity before a demand event rather than after. Each provisioned tier provides higher API request concurrency, a faster pod scheduling rate, and an etcd database size of up to 16 GB (double the standard). You can move between tiers, including back to standard, with a single API call, no migration, no downtime, and no new cluster required. Provisioned Control Plane is available on Kubernetes 1.28 and later, and EKS provides new real-time metrics for API request concurrency, pod scheduling rate, and database utilization to help you monitor and adjust tier selection over time.
Three new metrics make it straightforward to decide when a tier change is appropriate. If API request concurrency is consistently near the current tier's ceiling (indicating controllers and operators are competing for capacity), a higher tier reduces latency across the entire control loop. If pod scheduling rate is limiting how quickly large batches of pods (such as Spark jobs or training replicas) become ready, a tier with a higher scheduling rate directly reduces job start time. If etcd database size is approaching the limit, it signals a need to either clean up objects, review architectural patterns, or move to a higher tier before the limit affects operations.
Nova DasSarma's observations from Anthropic's production environment add practical texture to these features. For scheduling, Anthropic uses the default Kubernetes scheduler for most CPU workloads and for DaemonSets across tens of thousands of nodes. For large training jobs, where the relevant metric is scheduling rate per workload rather than per pod, Anthropic built an internal scheduler called Cartographer (which the team is considering open sourcing) that schedules an entire workload with full topology awareness in a single pass, co-locating pods within the same network domain to improve training performance by a factor of 2 to 3x. For storage, Anthropic uses Amazon S3 as the primary data store, with a prefetch buffer to absorb object storage latency, currently at 5,000 GB/s throughput without a parallel file system for most workloads. For DNS at this scale, reducing CoreDNS replica counts and scaling vertically improves cache hit rates and reduces upstream DNS load more effectively than running hundreds of replicas.
Key takeaways
EKS Ultra-Scale Clusters are built on three foundational changes to etcd (external consensus via a purpose-built transaction journal, in-memory cluster state, and partitioned key spaces) and three data plane changes (parallel image pull, full instance network bandwidth per pod, and pre-assigned IP prefixes at node launch) that together produce 3x faster pod readiness and control plane throughput previously unavailable in a single Kubernetes cluster.
Provisioned Control Plane makes the performance characteristics of that architecture available without requiring a 100,000-node workload to justify it. If you have large-scale AI training or inference, high-frequency deployments, or upcoming events where control plane latency directly affects your results, selecting the right tier proactively gives you predictable capacity instead of relying on reactive scaling to catch up. The same architecture that powers Anthropic's largest training runs is now one API parameter away.
Watch the full session recording: AWS re:Invent 2025 - Under the hood: Architecting Amazon EKS for scale and performance (CNS429)
- Topics
- ComputeContainers
- Language
- English
Relevant content
- asked 2 years ago
AWS OFFICIALUpdated 7 months ago
AWS OFFICIALUpdated 3 years ago
AWS OFFICIALUpdated 2 years ago