Complete a 3 Question Survey and Earn a re:Post Badge
Help improve AWS Support Official channel in re:Post and share your experience - complete a quick three-question survey to earn a re:Post badge!
AWS re:Invent 2024 - The future of Kubernetes on AWS
This blog post summarizes key highlights from the AWS re:Invent 2024 session "KUB201 - The future of Kubernetes on AWS" presented by Nathan Taber, Hyungtae Kim, and Steve Kendrex. We'll explore AWS's vision for Kubernetes, recent innovations in Amazon EKS, and how companies like Snowflake are leveraging EKS for AI workloads
How can we make Kubernetes work for everyone, from tech giants to small businesses? At AWS re:Invent 2024, Nathan Taber (Product Manager, AWS), Eswar Bala (Director, Amazon), and Hyungtae Kim (Principal Engineer, Snowflake) tackled this question head-on. They shared insights on the future of Kubernetes on AWS and how companies are leveraging Amazon EKS for cutting-edge applications. Let's dive into how Kubernetes has evolved and why it's become so crucial in modern cloud computing.
The Evolution of Cloud Computing and Kubernetes
Nathan Taber began by highlighting the fundamental shift in how businesses use technology. Since Amazon introduced Amazon Simple Storage Service (Amazon S3) and Amazon Elastic Compute Cloud (Amazon EC2) in 2006 [1][2], cloud computing has revolutionized how we store, process, and retrieve information. We can now deploy applications globally in minutes and instantiate entire data centers for complex AI models.
However, with applications running in various environments, organizations struggle to maintain a consistent operational model. Kubernetes has emerged as the leading solution, offering a simple set of APIs for managing large groups of servers and coordinating applications. For those new to the concept, Kubernetes is an open-source platform that automates the deployment, scaling, and management of containerized applications. Think of it as a smart traffic controller for your software containers, ensuring they're running where and when they should be.
Key advantages of Kubernetes:
- Simplicity: ~1500 API methods across 55 core resources (compared to 10,000+ for AWS Python SDK)
- Versatility: Works across various environments (data centers, cloud, even F16 fighter jets)
- Extensibility: 195+ CNCF projects and hundreds of landscape projects
With the growing adoption of Kubernetes, AWS has been hard at work enhancing Amazon EKS. Let's look at some of the exciting new features they've introduced
Recent Innovations in Amazon EKS
AWS has been improving Amazon EKS to address the challenges of running Kubernetes at scale. Here are some key developments, grouped by category:
Upgrade and Version Management
Keeping Kubernetes clusters up-to-date is crucial for security and performance. AWS has introduced several features to make this process smoother:
- Faster Version Updates: EKS now consistently releases new versions 35-40 days after upstream Kubernetes.
- Extended Version Support: Provides an additional 12 months of full support for any Kubernetes minor version.
- Upgrade Policies: Automatically keep control planes updated on standard versions for dev/test clusters.
- Upgrade Insights: Offers a "report card" for clusters, showing potential issues when upgrading to future versions.
Observability and Cost Management
Understanding what's happening in your cluster and managing costs are key challenges in Kubernetes. Amazon EKS now offers enhanced tools for monitoring and financial control:
- Enhanced Control Plane Observability: New metrics and pre-configured dashboards for easier cluster monitoring.
- Container Insights: Improved integration with CloudWatch for deeper performance analysis.
- Network Flow Monitor: New feature for troubleshooting network performance issues.
- Split Cost Allocation Data: Native cost reporting for Kubernetes resources, breaking down expenses at various levels (pod, deployment, namespace, etc.).
Add-ons and Integration
To extend Amazon EKS functionality and integrate with other AWS services, these new features have been introduced:
- Expanded Add-ons Catalog: New first-party add-ons and over 40 marketplace add-ons for easy integration of tools like Datadog, Kubecost, and Splunk.
- AWS Controllers for Kubernetes (ACK): Allows management of AWS resources directly from the Kubernetes API.
- Kube Resource Operator (KRO): New tool for abstracting and combining Kubernetes resources.
Networking and Infrastructure
Networking is a critical aspect of Kubernetes deployments. Amazon EKS has made significant improvements in this area:
- IPv6 Support: Complete IPv6 support across all aspects of EKS clusters.
- Application Recovery Controller Integration: Enables automatic traffic shifting between AZs during outages.
- EKS Hybrid Nodes: Connect on-premises or edge compute resources to EKS control planes in the cloud.
Simplification and Automation
To make Amazon EKS easier to use and manage, AWS has introduced these automation and simplification features:
- EKS Auto Mode: Simplifies cluster creation with pre-configured, production-ready setups.
- Node Health and Auto Repair: Automatically monitors and repairs node health, especially for GPU instances.
While these improvements benefit all EKS users, AWS has paid special attention to a rapidly growing use case: machine learning workloads.
Machine Learning on EKS
Amazon EKS has been widely adopted for machine learning workloads. AWS has invested in features to support this use case:
- Integration with EC2 Ultra Servers: This allows for high-performance GPU computing, crucial for training large ML models.
- S3 Mountpoint CSI Driver: Enables efficient access to large datasets stored in Amazon S3, a common requirement for ML workloads.
- EFA Kubernetes device plugins: Improves GPU networking performance, essential for distributed training jobs.
- Native support for frameworks like Ray: Simplifies the deployment of popular ML tools and libraries.
To see how these innovations translate into real-world impact, let's explore how Snowflake leverages Amazon EKS for their cutting-edge AI platform.
Real-World Example: Snowflake's AI Infrastructure
Hyungtae Kim from Snowflake shared their experience using Amazon EKS to power their Cortex AI platform. Cortex AI is Snowflake's suite of pre-built AI and ML models that helps businesses quickly deploy AI solutions without extensive data science expertise. They faced two major challenges: managing scarce GPU resources and dealing with the fragility of distributed AI workloads.
To address these, Snowflake developed several innovative solutions. They created a custom capacity controller for intelligent resource allocation, implemented a node health service for proactive diagnostics, designed a pod janitor for clean workload termination, and deployed an Invariant Enforcer to maintain cluster properties. Leveraging on Amazon EKS, Snowflake benefited from optimized performance for distributed training, simplified storage management using various AWS services, and automatic node remediation and auto-scaling.
Throughout their journey, Snowflake learned valuable lessons: they embraced impermanence in workload design, became strategic about hardware management, prioritized automation, and learned to plan network infrastructure generously. These insights have helped Snowflake build a robust, scalable AI infrastructure on top of Amazon EKS.
With all these advancements, you might wonder: what's next for Amazon EKS? Nathan Taber shared AWS's vision for the future.
The Future of Amazon EKS
Nathan Taber outlined AWS's vision for the future of EKS:
- Optimize for critical workload patterns at any scale
- Deepen AWS integrations and management tools
- Simplify Kubernetes-based platform building on AWS
- Accelerate open-source innovation
AWS aims to make Kubernetes "disappear" by simplifying its use and management, allowing companies to focus on their core business rather than becoming Kubernetes experts.
I'm already looking forward for it! As we wrap up, let's reflect on what all this means for Kubernetes users and the broader cloud computing landscape.
Conclusion
As Kubernetes marks its 10th year, AWS continues to evolve Amazon EKS to meet the changing needs of diverse organizations. Nathan Taber emphasized that the future of EKS will be shaped by ongoing customer feedback and real-world use cases. The goal is to simplify Kubernetes management while preserving its power and flexibility, allowing companies to focus on their core business objectives.
AWS encourages users to engage with their public containers roadmap, where they can submit ideas, vote on features, and participate in discussions about the future of EKS. For those looking to dive deeper into Amazon EKS, resources such as the EKS Best Practices Guide, EKS Workshop, and EKS Blueprints (Terraform / AWS CDK) are available.
For those interested in watching the full session, including detailed explanations and demonstrations from Nathan Taber, Hyungtae Kim, and Eswar Bala, the recording is available on the AWS YouTube channel.
Relevant content
- asked a year agolg...