I submitted an Apache Spark application to an Amazon EMR cluster, and the application fails with an "ExecutorLostFailure: Slave lost" error.
Resolution
When a Spark task fails because a node terminates or becomes unavailable, you might receive the following error:
"Most recent failure: Lost task 1209.0 in stage 4.0 (TID 31219, ip-###-###-##-###.compute.internal, executor 115): ExecutorLostFailure (executor 115 exited caused by one of the running tasks) Reason: Slave lost"
The following are some of the reasons you might receive this error.
High disk utilization because of an unhealthy node
In Apache Hadoop, NodeManager periodically checks the Amazon Elastic Block Store (Amazon EBS) volumes that are attached to the cluster's nodes. If disk utilization on a node with one volume is greater than the YARN property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage, then NodeManager considers the node unhealthy. Then, ResourceManager shuts down all containers on that node and doesn't schedule new containers. For more information, see NodeManager on the Apache Hadoop website.
After ResourceManager shuts down multiple executors because of unhealthy nodes, then the application fails with an ExecutorLostFailure: Slave lost error. To confirm that a node is unhealthy, review the NodeManager logs or the instance controller logs. The YARN_LOG_DIR variable in yarn-env.sh defines the location of the NodeManager logs. The instance controller logs are stored at /emr/instance-controller/log/instance-controller.log on the primary node. The instance controller logs provide an aggregated view of all the nodes of the cluster.
An unhealthy node shows a log entry that looks similar to the following:
2019-10-04 11:09:37,163 INFO Poller: InstanceJointStatusMap contains 40 entries (R:40): i-006ba###### 1829s R 1817s ig-3B ip-###-##-##-### I: 7s Y:U 11s c: 0 am: 0 H:R 0.0%Yarn unhealthy Reason : 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
i-00979###### 1828s R 1817s ig-3B ip-###-##-##-### I: 7s Y:R 9s c: 3 am: 2048 H:R 0.0%
i-016c4###### 1834s R 1817s ig-3B ip-###-##-##-### I: 13s Y:R 14s c: 3 am: 2048 H:R 0.0%
i-01be3###### 1832s R 1817s ig-3B ip-###-##-##-### I: 10s Y:U 12s c: 0 am: 0 H:R 0.0%Yarn unhealthy Reason : 1/1 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
To resolve this issue, increase the size of the EBS volumes that are attached to the core and task nodes. Or, delete unused data from the Hadoop Distributed File System (HDFS).
Spot Instances
If you use Amazon Elastic Compute Cloud (EC2) Spot Instances for Amazon EMR cluster nodes and one of the instances terminate, then you might receive an ExecutorLostFailure: Slave lost error. Spot Instances might terminate for the following reasons:
- The Spot Instance price is greater than your maximum price.
- Your available EC2 instances don't meet the demand for Spot Instances.
For more information, see Spot Instance interruptions.
To resolve this issue, use On-Demand Instances. Or, if you use Amazon EMR release version 5.11.0 or earlier, then upgrade to the latest version.
Amazon EC2 Auto Scaling policies
During frequent Auto Scaling events, a new node might receive an IP address that a previous node used. If a Spark application runs during a scale in event, then Spark adds the decommissioned node to the deny list so an executor doesn't launch on that node. If another scale out event occurs and the new node gets the same IP address as the decommissioned node, then YARN considers the new node valid. YARN attempts to schedule executors on the new node. Because the node remains on the Spark deny list, executor launches fail. When you reach the maximum number of failures, the Spark application fails with a ExecutorLostFailure: Slave lost error.
To resolve this issue, take the following actions:
To remove a node from the Spark deny list, decrease the Spark and YARN timeout properties. Complete the following steps:
- Add the following parameter to the /etc/spark/conf/spark-defaults.conf file:
spark.blacklist.decommissioning.timeout 600s
Note: This reduces the time that a node in the decommissioning state remains on the deny list. The default is one hour. For more information, see Configuring node decommissioning behavior.
- Modify the following YARN property in /etc/hadoop/conf/yarn-site.xml:
yarn.resourcemanager.nodemanager-graceful-decommission-timeout-secs 600
Note: This property specifies the wait time for running containers and applications to complete before a decommissioning node transitions to the decommissioned state. The default is 3,600 seconds.
For more information see, Spark enhancements for elasticity and resiliency on Amazon EMR.
Related information
Configuring Amazon EMR cluster instance types and best practices for Spot instances
How do I troubleshoot stage failures in Spark jobs on Amazon EMR?