I want to troubleshoot stage failures in Apache Spark applications on Amazon EMR.
Short description
In Spark, stage failures happen when there's a problem with processing a Spark task. These failures can be caused by hardware issues, incorrect Spark configurations, or code problems. When a stage failure occurs, the Spark driver logs report an exception similar to the following:
org.apache.spark.SparkException: Job aborted due to stage failure: Task XXX in stage YYY failed 4 times, most recent failure: Lost task XXX in stage YYY (TID ZZZ, ip-xxx-xx-x-xxx.compute.internal, executor NNN): ExecutorLostFailure (executor NNN exited caused by one of the running tasks) Reason: ...
Resolution
Find the reason code
For Spark jobs submitted with --deploy-mode client, the reason code is in the exception that's displayed in the terminal.
For Spark jobs submitted with --deploy-mode cluster, run the following command on the master node to find stage failures in the YARN application logs. Replace application_id with the ID of your Spark application (for example, application_1572839353552_0008).
yarn logs -applicationId application_id | grep "Job aborted due to stage failure" -A 10
You can also get this information from YARN ResourceManager in the application master container.
Resolve the root cause
After you find the exception, use one of the following articles to resolve the root cause: