Skip to content

re:Invent 2025 - Networking and observability strategies for Kubernetes

8 minute read
Content level: Advanced
0

Running Kubernetes at scale means managing two overlapping network planes: the VPC and the Kubernetes network layer. Without visibility across both, teams cycle between overly permissive and overly restrictive security postures, slowing down application delivery and leaving the network environment poorly understood. Session CNS417 shows how Amazon EKS Auto Mode, native network policies, and the newly launched Container Network Observability feature work together to

When you run dozens or hundreds of microservices on Amazon EKS, understanding who talks to whom is rarely straightforward. East-west traffic within the cluster, pod-to-AWS-service communication, and connections to external systems all layer on top of each other, creating a network environment that is difficult to reason about confidently. In session CNS417 at AWS re:Invent 2025, Rodrigo Bersa, Senior WW Containers Specialist Solutions Architect at AWS, and Lukonde "Luke" Mwila, Senior Amazon EKS Product Manager at AWS, walked through a layered approach to network operations: simplifying the infrastructure baseline with EKS Auto Mode, applying fine-grained access control with native Kubernetes network policies, and gaining the observability needed to make both work correctly. In this post, we'll walk through those three layers and show how they connect into a practical workflow for improving your EKS network security posture.

The challenge: two network planes, limited visibility

Most EKS environments share a recognizable pattern: a cluster hosting multiple microservices that communicate with each other and with AWS services like Amazon S3, Amazon DynamoDB, and Amazon ElastiCache, while also integrating with systems running on-premises or on the internet. As the environment grows, the network becomes the critical layer tying all of it together.

The challenge is that Kubernetes introduces its own network plane on top of the VPC, giving you two distinct layers to reason about simultaneously. Platform and security teams need to understand traffic behavior across both, but without sufficient data, conclusions become imprecise and decision-making slows down. In practice, many teams end up in one of two places: an overly permissive environment where pods communicate freely because no one has enough visibility to write accurate policies, or an overly restrictive environment where applying default-deny policies breaks applications because the communication paths were never fully understood upfront.

As the session demonstrated with a live sample e-commerce application, applying a default-deny network policy without first understanding actual traffic patterns causes pods to fail liveness and readiness probes, taking the application offline. The instinctive response is to revert, which lands teams back in the permissive state indefinitely. Breaking this cycle requires a third element: visibility.

Building the foundation: EKS Auto Mode and network policies

EKS Auto Mode provides the operational baseline that makes network policy adoption more practical at scale. Rather than requiring manual setup of networking components, Auto Mode embeds CoreDNS, kube-proxy, and the Amazon VPC CNI plugin directly into the cluster and keeps them patched and updated on the AWS side. It uses Karpenter for node lifecycle management, including rolling out configuration changes to node classes in a controlled way, limiting disruption to a configurable percentage of nodes at a time.

For network policies specifically, Auto Mode simplifies the setup considerably. Network policy support is enabled by default through the VPC CNI, which uses eBPF (Extended Berkeley Packet Filter) to evaluate policies at the kernel level without requiring a separate CNI plugin. eBPF inspects system calls at the kernel boundary, giving the policy enforcement engine a performant path for evaluating access rules without the overhead of a sidecar-based approach. This is the same kernel mechanism used for load balancing, security enforcement, and tracing in other contexts.

The default behavior for each node class is configurable. You can set it to default-allow for development or staging environments where teams are still mapping traffic flows, then change to default-deny once policies are in place. Auto Mode handles the rollout, spinning up new nodes with the updated configuration while constraining disruption to the percentage you define in the node pool configuration. In a large environment, this means you can roll out security policy changes progressively rather than all at once.

Container Network Observability: the missing piece

The visibility gap is what Container Network Observability in Amazon EKS directly addresses. The feature works through an agent deployed on each worker node as part of Amazon CloudWatch Network Flow Monitoring. The agent uses eBPF to capture the top 500 network flows per worker node, along with flow-level metrics including data transferred, retransmissions, and retransmission timeouts. These metrics are available in two forms: system-level metrics that can be scraped directly from the agent in OpenMetrics format, and flow-level metrics accessible through the EKS console.

For teams already standardized on their own observability stack, the system metrics integrate directly. If you're running Amazon Managed Service for Prometheus and Amazon Managed Grafana, you can scrape these metrics into your existing dashboards and set thresholds for bandwidth, packet rate, and connection tracking that notify you before issues escalate. The session demonstrated Grafana dashboards surfacing ingress and egress bandwidth by pod, ENI-level metrics, and per-node bandwidth allowance breaches, all in one view without additional instrumentation in the application itself.

For the tail end of an investigation, where you need to identify the specific source of a problem after an alert has fired, the EKS console now provides two complementary views. The service map visualizes east-west traffic within the cluster and shows pod-to-pod flows at the replica level, including bidirectional flow details and the volume of data transferred on each path. The flow table provides a filterable tabular view of flows organized across three perspectives: cluster traffic (east-west), AWS service traffic (pod to S3 or DynamoDB at launch), and external traffic (pod to destinations outside AWS).

A particularly useful capability for cost optimization is the ability to see cross-availability-zone traffic. If you've configured topology-aware routing or locality-based routing through a service mesh like Istio, you can now verify that the optimization is producing the expected traffic patterns rather than relying on indirect signals.

The session also highlighted how this feature changes the calculus around service mesh adoption. Container Network Observability surfaces sufficient network data for workloads regardless of whether they're running inside the mesh. Teams that adopted a service mesh primarily for observability can now get that coverage natively. Service meshes remain valuable for mutual TLS (mTLS), fine-grained traffic control, and advanced load balancing, but you no longer need to operate one solely for network visibility.

Shifting left with observed traffic patterns

The workflow the session proposed puts these capabilities together into a practical process. In development, you run with default-allow policies and use the service map and flow table to observe the actual communication patterns of your application. Once you understand who talks to whom, including paths you didn't design explicitly (DNS resolution, health checks, and sidecar communication), you write network policies that reflect those patterns accurately. You then apply default-deny, and because the policies were written from observed reality rather than guessed from architecture diagrams, the application continues to work.

This approach also makes the process a shared responsibility between platform and application teams. Application teams know where their services send traffic. Platform teams can see what traffic arrives from the cluster-wide view. The service map bridges both perspectives without requiring either team to fully understand the other's domain.

The session referenced a CNCF study showing that addressing a security issue in development costs approximately 640 times less than addressing the same issue in production. Starting with observed traffic patterns in a dev or staging environment and validating network policies before promoting to production is a direct application of that principle. Platform teams get the confidence to enforce strict access controls. Application teams get predictable, unblocked deployments because the policies are validated before they reach production.

Key takeaways

EKS Auto Mode, native network policies, and Container Network Observability address different parts of the same operational problem. Auto Mode removes the overhead of managing core networking components and makes it operationally feasible to adopt network policies at scale. Native network policies give you the access controls needed to meet security requirements at the Kubernetes layer. Container Network Observability gives you the data to write those policies accurately, detect anomalies quickly, and verify that cost optimization techniques are working as expected.

Together, these capabilities support a shift-left security model where teams catch and address network access issues in development rather than in production. As your EKS environment grows in the number of services and the complexity of its dependencies, having this level of visibility at both the Kubernetes and VPC network layers becomes a practical requirement for operating with confidence.


Watch the full session recording: AWS re:Invent 2025 - Networking and observability strategies for Kubernetes (CNS417)