I want to troubleshoot stage failures in Apache Spark applications on Amazon EMR.
Short description
You might receive stage failures when a Spark task has an issue. Stage failures are caused by hardware issues, incorrect Spark configurations, or code issues. When a stage failure occurs, the Spark driver logs report an exception that's similar to the following:
"org.apache.spark.SparkException: Job aborted due to stage failure: Task XXX in stage YYY failed 4 times, most recent failure: Lost task XXX in stage YYY (TID ZZZ, ip-xxx-xx-x-xxx.compute.internal, executor NNN): ExecutorLostFailure (executor NNN exited caused by one of the running tasks) Reason: (example-reason)"
Resolution
Identify the reason code for Spark jobs that you submit with --deploy-mode client
The reason code is located in the exception that's shown in the terminal.
If you submit the job from Amazon EMR Steps, then the reason code is located in the stderr file on the Amazon EMR console. You can also get the step stderr logs from the Amazon Simple Storage Service (Amazon S3) location that you specified for cluster logging. For example, you can use the s3://example-log-bucket/example-cluster-id/steps/example-step-id/ file path to find the logs.
To identify stage failures in the YARN application logs, run the following command on the primary node:
yarn logs -applicationId example-application-id | grep "Job aborted due to stage failure" -A 10
Note: Replace example-application-id with your Spark application ID.
You can get the YARN application from the Amazon S3 location that you specified for cluster logging. For example, you can use the s3//example-log-bucket/example-cluster-id/containers/example-application-id/ file path. You can also get the YARN application logs from the YARN ResourceManager in the application's primary container.
Resolve the root cause
After you identify the exception, use one of the following AWS Knowledge Center articles to resolve the issue: