Skip to content

re:Invent 2025 - Generative and Agentic AI on Amazon EKS

7 minute read
Content level: Advanced
0

AI teams that choose Kubernetes gain infrastructure control, workload portability, and a single cluster for business and AI workloads. The challenge is knowing how to get from that choice to production. This session provides a practical path from deploying your first agent to managing GPU fleets at scale.

Christina Andonov, Senior Specialist Solutions Architect at AWS, and Chris Splinter, Principal Product Manager on the Amazon EKS product team, walked through the full journey of running AI workloads on Kubernetes at re:Invent 2025. In this post, we'll cover how to build and deploy AI agents on Amazon EKS, how to size and run GPU inference and fine-tuning workloads, and what recent EKS launches mean for teams operating these workloads today.

Customers choose Kubernetes for AI for three consistent reasons: precise control over the underlying infrastructure to tune cost-performance ratios, portability across clouds and on-premises environments, and a single cluster that hosts business applications and AI workloads side by side. That combination has produced millions of GPU-powered EC2 instances running in EKS clusters every week, a number that more than doubled between 2024 and 2025. Gartner predicts that by 2028, 95% of new AI workloads will run on Kubernetes, up from less than 30% today.

Building and running AI agents on EKS

AI agents handle problems that require reasoning, which is a fundamentally different pattern from traditional business software built for deterministic behavior. The good news for Kubernetes teams is that deploying an agent on Amazon EKS follows the same patterns as deploying other containerized services.

The foundation is an agentic framework, a Python library that manages how your agent communicates with large language models (LLMs) and invokes external capabilities. The session used Strands Agents, an open-source framework AWS released in May 2025. With Strands, you define an agent in five lines of Python, containerize it, push it to a registry, and deploy to Amazon EKS using the same pipeline you use for other workloads.

To give an agent access to real-time data such as a live weather API, you add a tool. A tool is a regular Python function decorated with @tool, and the agent decides at runtime whether to call it. When you build multiple agents that share the same APIs or databases, wrapping those calls in a Model Context Protocol (MCP) server consolidates the integration layer and removes duplication. Most agentic frameworks connect to MCP servers natively.

Completing the production setup requires authentication, memory, and observability. Amazon Cognito handles authentication. Amazon DynamoDB and Amazon S3 support short-term and long-term agent memory through Strands Session Manager. Logs, metrics, and traces each serve a role, with traces deserving extra attention since the non-deterministic behavior of LLMs means you need a clear record of the path taken to reach a response. Ragas evaluates response quality, and Langfuse provides full trace visualization with latency metrics.

Running GPU inference and fine-tuning on EKS

When you run the model on Amazon EKS rather than calling an external API, hardware selection is the first decision. A practical sizing formula starts with the model size in gigabytes from its Hugging Face model card, adds a few gigabytes for KV cache and token generation memory, then pads by one to two gigabytes. A 40 GB model works out to roughly 45 GB of required GPU memory, which fits on a G6 instance with 48 GB of GPU memory. Quantization techniques such as QLORA can cut that requirement roughly in half, making smaller instance families like G5 (24 GB GPU memory) viable.

For capacity, on-demand, Savings Plans, and Spot purchasing options work with GPU instances. On-demand capacity reservations (ODCRs) let you lock in capacity for production workloads without a long-term commitment. Capacity Blocks provide prepaid GPU capacity in increments from 24 hours to 28 days, suited to batch fine-tuning jobs with a defined runtime.

Provisioning the GPU node into the cluster requires a Karpenter node pool and an EKS-optimized accelerated Amazon Machine Image (AMI). EKS Auto Mode ships with GPU support out of the box via a Bottlerocket-based AMI. For open-source Karpenter, you can use Bottlerocket or AL2023; with AL2023, you install the NVIDIA device plugin separately. Baking GPU drivers and kernel modules into the AMI removes runtime installation and cuts cold-start time, which matters because the target from instance launch to model serving is under two minutes. That window matches the Spot interruption notice period and keeps scaling responsive to variable inference traffic.

Scaling inference workloads requires a custom metric since Kubernetes' Horizontal Pod Autoscaler (HPA) has no native GPU metric. The practical path is to pull a metric from your inference framework (common choices include vLLM, Ray, and NVIDIA Dynamo) and feed it into HPA via KEDA (Kubernetes Event-driven Autoscaling). On the health side, EKS Auto Mode ships with node health monitoring and auto-repair configured by default, restarting or replacing an affected node within ten minutes of detecting a hardware failure.

Recent Amazon EKS launches for AI workloads

Amazon EKS added support for GB200 (launched in 2025) and GB300 (announced at re:Invent 2025). These NVIDIA Grace Blackwell GPUs target the largest training and inference workloads and support multi-node GPU-to-GPU communication via Elastic Fabric Adapter (EFA) and NVLink. The P6 family (B200 and B300 variants) delivers up to two times the performance of the previous P5 generation. For smaller-scale use cases, the single-GPU P5 4xlarge (NVIDIA H100) and fractional G6f instances are supported, with Amazon EKS AMI validation covering the full stack of GPU drivers, kernel modules, and software packages for each.

Fast container pulls via SOCI (Seekable OCI) reduce startup time for large inference containers. PyTorch, vLLM, and similar frameworks produce large images that traditionally take several minutes to pull to a node. SOCI introduces a Parallel Pull and Unpack mode that runs image download and disk unpack concurrently, cutting startup time with no changes to your build process. It is enabled by default in EKS Auto Mode for GPU and Trainium instance families.

Karpenter and Auto Mode received three additions for AI workload patterns. Capacity reservation support lets you specify an ODCR or ML Capacity Block ID directly in your node pool definition. Static capacity provisioning lets you define a fixed baseline of nodes that Karpenter provisions without waiting for pending pods, removing the need for balloon pods. Node Overlay (currently Karpenter only, with Auto Mode support planned) lets you pass custom pricing or resource attributes that Karpenter uses during instance selection, useful when you have negotiated pricing or require specific huge page configurations.

For large-scale clusters, the new EKS Provisioned Control Plane lets you pre-scale before a demand spike such as a product launch or a peak traffic event. The AWS Application Load Balancer Target Optimizer switches ALB from a push model to a pull model, where an agent on each node signals availability based on a configurable max concurrent request limit. For inference workloads, which have lower concurrency than typical web services, this drives higher GPU utilization and reduces error rates under heavy load. The hosted EKS MCP server (currently in preview) gives AI agents direct access to live cluster information including pod logs, Kubernetes events, and Amazon CloudWatch metrics, with IAM credentials passing through a local SIGV4 proxy to Kubernetes RBAC.

Running AI workloads on Amazon EKS does not require building a new foundation. The same CI/CD pipelines, observability stacks, and autoscaling primitives that work for your current workloads extend directly to agents, inference, and fine-tuning. The new additions (agentic frameworks, MCP servers, GPU-aware node pools, and inference-specific scaling metrics) fit into patterns that Kubernetes teams already know. The Amazon EKS team runs monthly virtual workshops on inference and agentic AI, updated each month. The EKS AI/ML user guide and Amazon EKS Blueprints for Terraform provide ready-to-deploy cluster configurations as a starting point.

Watch the full session: AWS re:Invent 2025 - Generative and Agentic AI on Amazon EKS (CNS344)