Guidance on optimizing and reducing cross-AZ data transfer charges in EKS clusters
EKS Architecture
When customers set up EKS Clusters on AWS using any of their preferred tools, the best practice is to deploy the Cluster's distributed components and compute infrastructure across multiple Availability Zones to ensure optimal performance, resiliency and consistent uptime (Amazon EKS Architecture). Technical Account Managers (TAMs) and Solutions Architects (SAs) emphasize the need for resiliency to customers and customers have also come to understand the importance of building resilient architectures. However, it is also important to remind customers of the cost of resilience and help customers proactively and continually identify areas of their infrastructure that may be driving huge cost due to resilience. Helping customers identify this important metric will help avoid surprises.
The EKS Cluster below is a sample cluster deployed to a VPC in a region with 3 Availability Zones.
A standard EKS cluster infrastructure with 4 nodes in a managed node group spread across 3 Availability Zones. |
---|
|
Sample EKS Workload
Using the simple EKS cluster below, pods are deployed to the nodes based on the available resources/capacity of the nodes. Since nodes are spread across multiple AZs, communication between the nodes results in across-AZ data transfer.
- Server is deployed to 2 nodes (192.168.74.137 and 192.168.6.195) in 2 different AZs for resiliency.
- Client is deployed to node (192.168.6.195) in a single AZ.
- Client sends traffic to both Servers.
- Data sent to Server on 192.168.6.195 does not generate inter-AZ traffic or cost.
- Data sent from Client (on 192.168.6.195) to Server (on 192.168.74.137) or vice versa generates inter-AZ traffic and cost.
Inter-AZ Data Transfer and Cost
Data transfer cost between Availability Zones in the same region is $0.01/GB in each direction. For very chatty applications with pods spread across multiple Availability Zones, the volume of data transferred can be very high and become a major cost for customers. Identifying this early in the deployment and continually monitoring it can help customers avoid this cost.
Monitoring Inter-AZ Data Transfer
One of the resources that customers can use to proactively and continuously monitor the inter-AZ data transfer cost is amazon-eks-inter-az-traffic-visibility. The code is available in the AWS Samples GitHub repo. With this solution deployed in an EKS cluster, customers can closely monitor data transfer metrics for applications and pods running in the cluster. The solution uses AWS Athena, S3, VPC Flow Logs, AWS Lambda and AWS Step Functions to automatically and periodically collect traffic data between pods. The data can be queried using AWS Athena and visualized using AWS QuickSight.
Distribution of pods across nodes and AZ | Number of bytes, source IP, destination IP, flow direction and AZs | Application/Pod specific traffic and bytes transferred |
---|
| | |
What About Cost Explorer?
Cost Explorer is able to provide cost breakdown by usage type and tags (when enabled in Cost Allocation Tags), however, Cost Explorer won't provide visibility into which pods are driving the cost. Examples of data available through Cost Explorer are below.
Cost Explorer showing data transfer and cost by Usage Type | Cost Explorer showing data transfer and cost by cluster-name |
---|
| |
Optimizing Inter-AZ Data Transfer
There is no one-size-fits-all when it comes to Kubernetes or EKS workloads so no single solution can be proposed, but with the above data, customers can build QuickSight dashboards like below to visualize the data and continuously monitor data transfer trends. This data will provide visibility into chatty applications driving high data transfer charges and help to define a more optimized approach for EKS workloads. Optimization can be achieved through several methods.
- One common method which can be immediately implemented is to use the Node Affinity/Anti Affinity property of Kubernetes to define Kubernetes Taints and Tolerations for chatty applications which can immediately place chatty applications on the same nodes thereby reducing high inter-AZ traffic.
- Another option could be to deploy a central application as a DaemonSet ensuring the pod is available on the same node to any other application pods that need to access it. This does not eliminate cross-AZ data transfer but can help reduce it.
- A long-term approach is to redesign the applications to reduce chatter. This requires extensive architectural review and decisions but it will help the customer to make better decisions when building or adopting applications. The Cost Optimization Pillar of the AWS Well-Architected Framework helps you understand the decisions you make while building workloads on AWS.
Notes
Not all inter-AZ data transfer can be eliminated especially for customers that want to achieve high resiliency so this discussion should be approached with the necessary caveats and with understanding of the customer's priorities.