By using AWS re:Post, you agree to the AWS re:Post Terms of Use

Optimizing your Amazon EKS compute costs with Karpenter

12 minute read
Content level: Intermediate
0

This article demonstrates how Vorwerk, one of AWS’s Enterprise Support customers, used Karpenter to successfully optimize their Amazon Elastic Kubernetes Service (Amazon EKS) compute costs. The article also highlights how AWS Enterprise Support helps organizations as they implement and integrate new AWS services and features into their infrastructure.

Introduction

AWS Technical Account Managers (TAMs) work closely with your team to provide an ongoing series of engagements that are customized to your needs throughout your cloud journey. As part of this journey, TAMs provide tailored technical solutions to help you optimize your workloads and reduce costs, leading to sustainable cost savings.

Vorwerk is a German family-owned company that was founded in 1883. Vorwerk produces household appliances, such as the iconic Kobold vacuum cleaner and Thermomix kitchen appliance. Mindcurv, part of Accenture Song, is a full-service digital enabler, supporting Vorwerk in its digital transformation journey, particularly when Vorwerk introduced its revolutionary connected (IoT) kitchen appliances, the Thermomix line.

Cookidoo is Vorwerk's digital platform and recipe ecosystem that's designed to complement their Thermomix kitchen appliance. This platform provides access to thousands of guided cooking recipes, meal plans, and cooking tips for over five million users. The platform has operated on Amazon EKS since 2020. It manages eight million active IoT appliances across four AWS Regions, including China, with multiple production, staging, and sandbox environments. The platform contains over 65 microservices, leverages PostgreSQL, Redis, and NoSQL databases, and uses hundreds of domains.

Vorwerk's TAM provided expert guidance and activated Karpenter in the customer's Amazon EKS cluster so that the customer achieved substantial cost savings.

Challenges

Vorwerk operates a diverse array of workloads that span multiple programming languages, each with unique resource demands and startup characteristics. These workloads must scale rapidly to accommodate the millions of IoT appliances that they serve globally, especially during peak cooking times when traffic surges across different Regions.

Vorwerk wanted to lower their costs sustainably by looking for opportunities to remove underutilized nodes, replace existing nodes with more cost-effective alternatives, and consolidate workloads onto more efficient compute resources. As part of Enterprise Support, TAMs perform AWS Trusted Advisor reviews to identify and recommend areas of cost savings. During a review with Vorwerk, their TAM identified that a significant portion of instances in their Amazon EKS cluster's compute resources were underutilized.

As application loads increase, the Vorwerk runtime team encountered difficulties in cost optimization and availability. Managing large, shared clusters that must cater to diverse application requirements adds complexity. It can be challenging to maintain optimal resource utilization and high availability.

Before Karpenter, Vorwerk planned to use the Cluster Autoscaler solution. However, for this solution to work without any undesired behavior, all the Amazon Elastic Compute Cloud (Amazon EC2) instance types within a node group must have similar CPU, memory, and GPU specifications. This is because the Cluster Autoscaler uses the first instance type that's specified in the node group policy to simulate Pod scheduling. If the policy includes additional instance types with higher specifications, then the Cluster Autoscaler will only schedule Pods based on the first instance type. As a result, node resources might be underutilized after scaling out. Also, if the policy includes additional instance types with lower specifications, then Pods might fail to schedule on those nodes because of resource constraints.

Karpenter offers several advantages over Cluster Autoscaler. Karpenter provides rapid instance turnaround by launching right-sized compute resources in response to changing application loads in less than a minute. With Karpenter, you don't have to create dozens of node groups to achieve the flexibility and diversity that you need. Karpenter can also create diverse node configurations by instance type with flexible NodePool options. You can manage diverse workload capacity with a single, flexible NodePool through Karpenter instead of managing numerous specific custom node groups. Because of these benefits, Vorwerk chose to implement Karpenter with several instance families and generations to reduce the dependency on specific Amazon EC2 family types. This solution supports efficient and flexible compute for Vorwerk's applications.

Important: Karpenter is an open-source software. You're responsible for installing, configuring, and managing this software in your Kubernetes clusters. AWS provides technical support when Karpenter is run unmodified using a compatible version in Amazon EKS clusters. Similar to any other customer managed software, you must maintain the availability and security of the Karpenter controller. Also, you must complete appropriate testing procedures when you upgrade it or the Kubernetes cluster where it's running. There isn't an AWS service level agreement (SLA) for Karpenter. You're responsible for making sure that the EC2 instances that are launched by Karpenter meet your business requirements.

Solution overview

As an early adopter, Vorwerk implemented Karpenter version v0.4.2. The following code sample represents the Karpenter configuration of NodePool type for customization of various Karpenter settings. The NodePool configuration provisions on-demand EC2 instances from a wide selection of instance types (compute, general purpose, and memory-optimized). With 16 or 32 CPUs, this configuration uses the Nitro hypervisor and a generation that's newer than the fifth generation for disruption of at most 1 node per hour for maintenance. The configuration also automatically expires nodes after 7 days. The kubelet is configured with tight eviction thresholds to make sure that Pods are evicted before the node runs out of memory.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  annotations:
  labels:
    app.kubernetes.io/instance: karpenter
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: karpenter-nodepools-raw
    app.kubernetes.io/version: 1.0.0
    helm.sh/chart: raw-2.0.0
  name: default
spec:
  disruption:
     budgets:
      - duration: 10m
        nodes: '1'
        schedule: 0 * * * *
      - duration: 50m
        nodes: '0'
        schedule: 10 * * * *
    consolidationPolicy: WhenUnderutilized
    expireAfter: 170h
  limits:
    cpu: 640
  template:
    metadata: {}
    spec:
      kubelet:
        evictionHard:
          memory.available: 2%
        evictionSoft:
          memory.available: 3%
        evictionSoftGracePeriod:
          memory.available: 1m0s
        systemReserved:
          memory: 2Gi
      nodeClassRef:
        name: default
      requirements:
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values:
            - c
            - m
            - r
        - key: karpenter.k8s.aws/instance-cpu
          operator: In
          values:
            - '16'
            - '32'
        - key: karpenter.k8s.aws/instance-hypervisor
          operator: In
          values:
            - nitro
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values:
            - '5'
        - key: kubernetes.io/arch
          operator: In
          values:
            - amd64
        - key: karpenter.sh/capacity-type
          operator: In
          values:
            - on-demand

The following graph from Grafana illustrates the CPU usage pattern of the ingress-nginx component in one of the production Amazon EKS clusters. It shows a recurring pattern where the CPU utilization spikes significantly during peak hours across multiple countries. These peak hours are commonly known as cooking hours because they represent peak times when customers actively use Thermomix for meal planning, ingredient shopping, or cooking. After cooking hours, the CPU usage gradually decreases. Karpenter autoscaler dynamically adjusts the cluster resources to match the fluctuating demand. This optimizes resource utilization and minimizes unnecessary overhead during off-peak periods.

Enter image description here

With Karpenter, the organization achieved a reduction in the overhead that’s associated with managing their Amazon EKS cluster:

  • The number of nodes in the production EU cluster was reduced from 77 to 36.
  • The number of nodes in the production North America cluster was reduced from 55 to 22.
  • The number of nodes in the production Oceania cluster was reduced from 52 to 21.

Vorwerk implemented Karpenter across all their Kubernetes clusters for cost optimization objectives. Since this adoption, Vorwerk has experienced a substantial decrease in their overall Amazon EC2 spending. The following graph, where the y-axis represents US Dollars, shows the cost impact of this implementation.

Enter image description here

Vorwerk overcame the initial challenges related to Karpenter implementation. This section details a summary of the issues that Vorwerk encountered and their resolutions.

Customizing disk volume size

Issue

Karpenter didn't allow customization of instance disk volume sizes. The default value was too low for Vorwerk's needs.

Workaround

Vorwerk used Karpenter's provisioner capability to specify a launch template. Then, Vorwerk used Terraform to create a custom launch template with the desired disk volume configuration to match their requirements.

Resolution

To address Vorwerk's requirements, Karpenter implemented the addition of the BlockDeviceMappings feature in Karpenter v0.7.0. This feature allows configuration of volume size, type, IOPS, throughput, encryption, encryption key, and the deleteOnTermination flag.

resource "aws_launch_template" "karpenter_launch_template" { name = "Karpenter-${data.aws_eks_cluster.eks_cluster.name}-template-${var.provisioner_name}"

block_device_mappings { device_name = "/dev/xvda" ebs { volume_size = var.ebs_volume_size volume_type = "gp3" } } ... }

The provisioner’s custom launch template looks like the following:

provider: launchTemplate: Karpenter-nonprod-eks-eu-template-default

Accounting for DaemonSets

Issue

Vorwerk faced some resource contention issues when the default provisioner creates a new node, and the DaemonSet is scheduled to run on it. Before the DaemonSet pod is assigned, Karpenter already assigns the existing pending Pods to the new node and occupies the node’s full capacity. As a result, the DaemonSet Pod for that node remains in a perpetual Pending state. Karpenter doesn’t consider pending DaemonSet Pods as a cause for creating new nodes, leading to a potential resource contention scenario.

Workaround

The initial solution involved assigning a higher priority to DaemonSets to make sure that the DaemonSet Pods were running on the nodes. Any Pod that was evicted to accommodate the high-priority DaemonSet Pods transitioned to a Pending state. Karpenter provisioned additional nodes until all Pods were in the Running state, resulting in the creation of more instances.

Resolution

The Karpenter team fixed a bug that was related to not correctly including DaemonSets in binpacking calculations in version 0.5.0.

Stability issues

Issue

Vorwerk faced stability issues when they activated the consolidation feature in Karpenter. Certain applications showed instability and Pods undergoing frequent recreation cycles.

After investigating the node status, the team observed anomalous patterns of node deletion and creation, with some nodes having lifespans shorter than 1 minute. Further analysis revealed that the consolidation process occurred at an unnecessarily high frequency. This resulted in application thrashing.

Resolution

To address this issue, the Vorwerk runtime team scheduled consolidation activities exclusively during off-peak hours. To enhance monitoring capabilities and gain deeper insights into cluster behavior, the runtime team used Grafana to develop a custom Karpenter dashboard.

Engagement with TAMs

As an AWS Enterprise Support customer, Vorwerk implemented Karpenter and optimized their Amazon EKS environment with technical guidance and recommendations from their TAM. The TAM identified opportunities for cost optimization improvements and addressed emerging issues during the implementation phase. Also, the TAM engaged subject matter experts, including a specialist TAM (STAM) and Amazon EKS and Karpenter service team members, to provide additional guidance. After checking the NodePool configuration, the STAM noticed that the node budget allowed interruption from only one node at the beginning of every hour. Based on discussions with the customer, it was known that the workload could tolerate more interruptions. Therefore, the STAM suggested that Vorwerk increase the number of interruptible nodes for cost optimization purposes. Quarterly Business Reviews, along with regular check-ins and cadence calls, helped the Vorwerk team to control Amazon EKS compute spending and regularly monitor progress towards their desired outcomes.

Structured engagement with TAMs fosters transparency and provides insight into recent bug fixes and new features that are relevant to Vorwerk's operations. That way, Vorwerk can use the latest advancements.

Conclusion

After adopting Karpenter, Vorwerk experienced substantial benefits. The organization reduced the overhead that's associated with managing their Amazon EKS clusters and delivered long-term cost savings on EC2 instances. With Karpenter's capabilities, Vorwerk achieved a remarkable 60% decrease in compute usage across all environments. Also, Karpenter alleviated the operational burden that Vorwerk previously faced in managing node groups. Because of these efforts, Vorwerk's scale-up operations were more streamlined. Recently, Vorwerk conducted an evaluation and embraced Karpenter consolidation with disruption budgets to further optimize their infrastructure. Vorwerk consciously overprovisions their infrastructure to address performance and resiliency needs during peak holiday seasons, such as Christmas and Easter. With Karpenter in place, development teams can check their horizontal Pod autoscaler configurations and grant their applications extra maximum nodes, memory, and CPU resources. This is because Karpenter manages the number of nodes that are required dynamically and reduces manual node management from the Vorwerk team.

AWS Support cloud engineers and TAMs can help you with general guidance, best practices, troubleshooting, and operational support on AWS. To learn more about our plans and offerings, see AWS Support.


About the authors

Enter image description here

Hakan Akkurt

Hakan Akkurt is a Senior TAM at AWS based in Germany, with over 20 years of technology experience to the role. He works closely with Enterprise Support customers and uses his expertise in operations, resiliency, consulting, architecture, and security to help customers improve their AWS Cloud environments. Hakan is also an active member of the containers community at AWS, where he shares his extensive knowledge of containers.

Enter image description here

Sven Gerlach

Sven Gerlach leads Vorwerk Digital R&D’s Cloud Platform Engineering and Site Reliability Engineering teams. Since joining Vorwerk in 2019, he has spearheaded significant advancements in these domains. Previously, Sven has excelled as a software developer, software architect, and product owner. He leverages and combines his extensive expertise in software development, product management, and agile methodologies.

Enter image description here

Gustavo Recio

Gustavo Recio is a member of the Vorwerk’s Site Reliability Engineering team. In previous iterations of team topology, he was the tech lead of the infrastructure team. At Mindcurv, part of Accenture Song, Gustavo is the Practice Lead for Cloud Platform Engineering.

Enter image description here

Antonio Bernabeu

Antonio Bernabeu is a member of the Vorwerk runtime team. He is responsible for developing, operating, maintaining, and evolving the runtime part of Vorwerk’s developer platform. At Mindcurv, part of Accenture Song, Antonio is a cloud platform engineer.

AWS OFFICIAL
AWS OFFICIALUpdated 19 days ago355 views