How does EMR on EKS support Spot Instance usage, and what happens if an executor running on a Spot node is interrupted?

0

Amazon EMR on EKS allows Spark workloads to run on EKS-managed infrastructure, giving users the flexibility to optimize cost by leveraging EC2 Spot Instances. Spot Instances are a great way to reduce compute costs, especially for transient, fault-tolerant jobs—but they come with the risk of sudden interruptions when AWS reclaims capacity.

In the context of EMR on EKS:

  • How does Spark on EMR handle executor placement across On-Demand and Spot nodes in a mixed-instance EKS node group?
  • What mechanisms are available (e.g., tolerations, taints, node selectors, labels) to control where Spark executors are scheduled?
  • What happens to a running Spark job when a Spot node is reclaimed? How does EMR on EKS ensure fault tolerance or job recovery in this scenario?
  • Are there strategies to gracefully handle Spot interruptions and retry logic?
  • How does Spark's internal retry mechanism interact with Kubernetes pod scheduling in the event of a lost executor?
AWS
asked 9 days ago24 views
3 Answers
0

Amazon EMR on EKS supports the use of Spot Instances for running Apache Spark and Apache Flink workloads, offering significant cost savings compared to On-Demand instances. However, it's important to understand how the system handles potential interruptions to ensure fault tolerance and job recovery.

For Apache Flink workloads, EMR on EKS has implemented graceful decommissioning of Task Managers running on Spot Instances. This feature is available in version 6.15.0 and higher. When a Spot Instance interruption notice is received, the system initiates a just-in-time (JIT) checkpointing process for streaming jobs. This allows the Job Manager to handle the interruption signal, block scheduling of additional tasks on the affected Spot Instance, and perform a checkpoint of the running job's state.

After the interruption, the Job Manager employs a combined restart mechanism. This mechanism attempts to restart the job either after reaching the target resource parallelism or at the end of a configured window. This approach helps prevent consecutive job restarts that might be caused by multiple Spot Instance terminations in a short period, thus improving overall job uptime and performance.

For Spark workloads, while specific details about executor placement and scheduling mechanisms aren't provided in the given sources, it's likely that EMR on EKS leverages Kubernetes' native scheduling capabilities. This could include the use of node selectors, labels, or taints and tolerations to control where Spark executors are scheduled across On-Demand and Spot nodes in a mixed-instance EKS node group.

When a Spot Instance is reclaimed, the behavior depends on the specific workload and configuration. For streaming jobs, the checkpointing mechanism helps ensure that the job can be resumed from the last saved state. For batch jobs, Spark's internal retry mechanism would likely interact with Kubernetes to reschedule lost executors on available nodes.

To minimize the impact of Spot Instance interruptions, it's recommended to follow best practices for using Spot Instances. This includes designing applications to be fault-tolerant and leveraging the two-minute warning provided by Spot Instance interruption notices. Users can create rules in Amazon EventBridge to capture rebalance recommendations and interruption notifications, triggering checkpoints or gracefully handling the interruption.

While the sources don't provide specific details about Spark's interaction with Kubernetes pod scheduling in the event of a lost executor, it's reasonable to assume that EMR on EKS would attempt to reschedule the executor on available nodes, leveraging Kubernetes' scheduling capabilities and Spark's internal retry mechanisms.

Overall, EMR on EKS provides a robust platform for running big data workloads on Spot Instances, with built-in mechanisms to handle interruptions and ensure job continuity, especially for streaming workloads using Apache Flink.
Sources
Graceful decommission of Spot Instances with Flink on Amazon EMR on EKS - Amazon EMR
Best practices for Amazon EC2 Spot - Amazon Elastic Compute Cloud
Introducing Amazon EMR on EKS with Apache Flink: A scalable, reliable, and efficient data processing platform | AWS Big Data Blog

profile picture
answered 9 days ago
0

How does Spark on EMR handle executor placement across On-Demand and Spot nodes in a mixed-instance EKS node group?

Spark on EMR running on EKS relies on Kubernetes' scheduler to place executor pods on available nodes. If your EKS cluster includes both On-Demand and Spot instances (e.g., via separate node groups), Kubernetes makes scheduling decisions based on:

 * Node availability
 * Pod resource requirements (CPU, memory)
 * Node labels and affinity rules (if configured)

By default, Spark executors may land on either On-Demand or Spot nodes unless you explicitly control scheduling using node selectors, taints/tolerations, or affinity/anti-affinity rules. Spark itself is agnostic about the underlying EC2 instance type. It treats each executor pod equally; however, the cost and reliability characteristics of the node are determined by EKS and EC2.

AWS
answered 9 days ago
0

What mechanisms are available (e.g., tolerations, taints, node selectors, labels) to control where Spark executors are scheduled?

You can use Kubernetes-native mechanisms to control pod placement:

Node Selectors: Add node labels like spot-instance=true or node-lifecycle=spot, then configure Spark pods to match.

--conf spark.kubernetes.node.selector.node-lifecycle=spot

Taints and Tolerations: Taint your Spot nodes (e.g., spotInstance=true:NoSchedule) and configure Spark tolerations:

"spark.kubernetes.tolerations" : '[{"key":"spotInstance","operator":"Equal","value":"true","effect":"NoSchedule"}]'

Affinity/Anti-affinity Rules: Use spark.kubernetes.affinity to spread or isolate pods across certain types of nodes.

These allow fine-grained control, ensuring that critical components like the Spark driver remain on On-Demand nodes, while executor pods can be scheduled on cheaper Spot nodes.

AWS
answered 9 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions