AWS re:Invent 2024 - Amazon EKS as data platform for analytics

10 minute read
Content level: Expert
0

This blog post summarizes the AWS re:Invent 2024 session "Amazon EKS as data platform for analytics" presented by Roland Barcia, Christina Andonov, and Victor Gershkovich. We'll explore how to transform Amazon EKS into a high-performance analytics platform, looking at optimization techniques, best practices, and AppsFlyer's real-world implementation.

Is your organization running both web applications and data analytics workloads? Are you struggling to optimize your Kubernetes platform for both? At AWS re:Invent 2024, Roland Barcia (Director, Specialist Technology Team, AWS), Christina Andonov (Senior Specialist Solutions Architect, AWS), and Victor Gershkovich (R&D Group Leader, Data Platform, AppsFlyer) delivered an insightful session on transforming Amazon EKS into a high-performance data analytics platform. This blog post summarizes their key insights on building scalable, cost-efficient analytics environments on Kubernetes.

The Evolution of Platform Engineering for Data Workloads

Roland Barcia opened the session by highlighting how data has become a critical commodity in the age of generative AI and machine learning. Organizations now have diverse data personas – from external users requiring real-time decisions to internal teams processing massive datasets for business insights.

The collision between traditional platform engineering and data processing presents unique challenges. While the first generation of platform engineering focused on web applications and microservices, we're now entering an era where data scientists and engineers bring new requirements:

  • Stateful applications that require specialized storage
  • Specialized hardware like GPUs for ML/AI workloads
  • Intense, bursty computing needs for data processing
  • Diverse tooling preferences among data scientists and engineers

This evolution introduces new challenges for platform teams:

  1. Resource optimization - Providing the right compute types, storage, and networking
  2. Workload isolation - Preventing data jobs from interfering with transactional systems
  3. Cost efficiency - Balancing performance with budget constraints
  4. Developer autonomy - Supporting data engineers' tool preferences while maintaining governance

As Roland explained, "These are now new concerns in this next generation" of platform engineering.

Optimizing Amazon EKS for Analytics

Christina Andonov explored the technical best practices for optimizing Amazon EKS clusters for analytics workloads. She organized her recommendations into three logical layers:

  1. Layer 1: Building a production-ready cluster optimized for analytics
  2. Layer 2: Installing purpose-specific open-source tools
  3. Layer 3: Onboarding tenants and providing self-service capabilities

Christina emphasized that analytics workloads have fundamentally different traffic patterns from business applications. While web applications tend to have predictable, consistent traffic (like "weather in California"), analytics workloads resemble "the weather in the Caribbean during hurricane season" - with massive, bursty job submissions that require rapid scaling.

Networking Optimizations

To avoid IP exhaustion without sacrificing performance, Christina recommended:

She noted that while IPv6 is fully supported in Kubernetes and recent Spark versions (3.4+), organizations should thoroughly test it as other components in their stack might not yet be compatible.

DNS and Service Discovery Optimizations

For reliable service discovery during rapid scaling, Christina highlighted:

As Christina humorously pointed out, "Do you know why CoreDNS gets invited to all the parties? Because it resolves everything... well, most of the time."

Compute Scaling Optimizations

For improved scaling performance with analytics workloads, Christina recommended:

  • Replacing Cluster Autoscaler with Karpenter for faster scaling and cost optimization
  • Configuring separate NodePools for Spark driver pods (on-demand) and executors (spot)
  • Using instance capacity awareness to avoid interrupted jobs

She emphasized that Karpenter can bring up instances in under a minute compared to 2-3 minutes with Cluster Autoscaler, making it ideal for the bursty nature of analytics workloads.

Storage Optimizations for Analytics

For data-intensive operations like Spark shuffle, Christina provided these key recommendations:

  • Using instances with built-in SSD drives (d-type instances)
  • Configuring RAID0 for instances with multiple SSD drives via Karpenter's instanceStoragePolicy: RAID0
  • Implementing Amazon EBS volumes for long-running jobs that need checkpointing
  • Using both local storage and Amazon EBS depending on job requirements

Container Image Performance

To achieve pod startup times under one minute, Christina suggested:

Comprehensive Monitoring

Christina emphasized monitoring three critical components beyond the usual metrics:

  1. Kubernetes control plane metrics
  2. AWS API throttling metrics (especially Amazon EC2 and Amazon EBS)
  3. Network metrics (particularly for UDP protocol used by CoreDNS)

Building a Purpose-Built Data Processing Platform

For Layer 2, Christina discussed installing open-source tools to transform a vanilla Amazon EKS cluster into a specialized analytics platform:

Christina addressed the common question of whether to combine different processing frameworks (like Spark and Flink) in the same cluster. Her guidance: "If you have smaller clusters, up to 2-300 nodes, yes, you can run them on the same cluster. But if your clusters start growing beyond that point, it's good to start thinking about separating them."

Empowering Data Teams with Self-Service Capabilities

For Layer 3, Christina focused on tenant isolation strategies and self-service capabilities. While some organizations use one cluster per team, most implement multi-tenant clusters with namespace isolation.

Christina identified a key friction point: data engineers often must request AWS resources (IAM roles, Amazon S3 buckets, Amazon RDS instances) through tickets, creating bottlenecks. Her solution? Extend the Kubernetes API using AWS Controllers for Kubernetes (ACK) to provide self-service capabilities.

"ACK is an open-source project that AWS open-sourced a while back. Recently, the Amazon EKS service team took ownership of all of the controllers for ACK. Currently, we're at 50 GA controllers," Christina explained. This means organizations can provide data engineers with familiar Kubernetes interfaces to provision AWS resources securely without tickets or manual intervention.

For organizations with multiple platform teams, Christina advised standardizing not just on tools but on how they're used: "When you start splitting your platform teams, let's say you end up with 10 platform teams and you standardize on Terraform, you're gonna end up with 10 different code bases guaranteed."

Real-World Implementation: AppsFlyer's Amazon EKS Analytics Platform

Victor Gershkovich from AppsFlyer shared how their move from Amazon EC2-based Spark clusters to Amazon EKS transformed their data processing capabilities. AppsFlyer processes over 100 petabytes of data daily through thousands of different Spark jobs, with strict SLA requirements.

"One decision can change your data organization", Victor explained. "It can boost your performance, enrich your observability, save costs, and empower your developers".

Victor showcased the dramatic improvement in their cluster scaling after implementing Karpenter:

  • Before: EC2-based clusters with poor utilization and inefficient scaling
  • After: Precise scaling with 80% peak utilization and minimal idle time

In a typical 24-hour cycle, AppsFlyer's platform performs approximately 1,600 node creations and terminations just for their compaction workload alone. Karpenter's intelligent instance selection led to:

  • Predominant use of Graviton instances (with fallback to older generations when needed)
  • 15% of the nodes being bare-metal instances for maximum performance
  • Automatic distribution across availability zones based on cost efficiency
  • All job pods running within the same AZ to eliminate cross-AZ data transfer costs

Resilience to Spot Interruptions

Despite experiencing dozens of spot interruptions per hour, AppsFlyer maintained their SLAs by:

  • Configuring Karpenter to hook spot termination signals to Spark
  • Setting up migration of intermediate data within a two-minute window
  • Eliminating reprocessing needs when nodes fail

Victor shared their configuration best practices:

  • Using Amazon Linux 2023 for fast boot times (under 10 seconds)
  • Adjusting node decommission budgets based on processing trends
  • Configuring Spark for termination awareness
  • Utilizing local storage for optimal performance

Enhanced Observability and Business Intelligence

AppsFlyer used metrics from Karpenter, Kubernetes, and Spark together, giving them valuable new insights:

  • Percentage breakdown of each data processing flow
  • Start and end times for each processing stage
  • Daily, weekly, and monthly processing trends
  • Real-time cost calculations per dataset processed

"By adding Karpenter metrics for price estimation, we can calculate how much each data processing cost. So we get the price for processing our data per minute in near real-time", Victor explained. "This is huge, both for engineering and business purposes".

These insights help AppsFlyer to optimize costs, improve customer pricing models, and make data-driven decisions about their entire processing pipeline.

Developer Autonomy and Infrastructure as Code

AppsFlyer implemented a Git-based workflow that empowers data engineers with full autonomy over their applications:

  • Structured repositories for infrastructure, applications, and third-party integrations
  • Environment segregation (development, staging, production)
  • Automated workflows and validation for executions and deployments

"This approach allows us to manage the infrastructure of Kubernetes and application components from a single interface", Victor explained. "It raises the velocity and autonomy of the data engineers and contributes to the knowledge of both [platform and data engineers]".

Business Impact

The move to Amazon EKS delivered impressive business outcomes:

  • 60% cost reduction compared to their previous Amazon EC2 Intel-based Spark clusters
  • 35% improvement in SLA performance
  • Significantly enriched observability
  • Reduced operational overhead for platform engineers

Conclusion: Building a Future-Proof Analytics Platform on Amazon EKS

This session showed how Amazon EKS can become a powerful and cost-effective platform for data analytics with the right setup. By following Christina's advice, teams can make their Kubernetes clusters work better for data processing - from network settings and scaling to storage options and monitoring.

As seen in Victor's real-world example at AppsFlyer, these improvements led to real business benefits: lower costs, better performance, and clearer insights into their data. Beyond the technical side, success also depends on how teams work together. Instead of keeping platform teams small, companies should focus on using the same tools and methods across teams to help everyone grow and work better together.

As Roland pointed out, remember that developers and data teams are using your platform. If you give them self-service tools while still maintaining proper controls, they'll actually use what you build. Setting up a data platform on Amazon EKS takes work, but the benefits in performance, cost savings, and team productivity make it worth the effort. To watch the complete session and learn more details, visit the AWS YouTube Channel.

profile pictureAWS
EXPERT
published 10 days ago109 views