My Apache Spark job in Amazon EMR fails, and I receive the "Container killed on request. Exit code is 137" error.
Short description
When a container runs out of memory, YARN automatically stops the container and you might receive the following error message:
"Container killed on request" stage failure: Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 3.0 failed 4 times, most recent failure: Lost task 2.3 in stage 3.0 (TID 23, ip-###-###-##-###.compute.internal, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Container marked as failed: container_1516900607498_6585_01_000008 on host: ip-###-###-##-###.compute.internal. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137"
Resolution
Increase driver or container memory
To increase container memory of your running cluster or single job, first connect the primary node of your cluster. Then, modify the spark.executor.memory or spark.driver.memory parameters in your spark-defaults.conf Spark configuration file.
Running cluster
To open spark-defaults.conf, run the following command:
sudo vim /etc/spark/conf/spark-defaults.conf
To increase the container memory, add or modify the spark.executor.memory or spark.driver.memory parameter in spark-defaults.conf:
spark.executor.memory 10g
spark.driver.memory 10g
Note: Replace 10g with a value that's appropriate for your cluster's available resources and workload requirements.
Single job
To increase memory, use the --executor-memory or --driver-memory option when you run the following spark-submit command:
spark-submit
--executor-memory 10g
--driver-memory 10g
...
Note: Replace 10g with a value that's appropriate for your cluster's available resources and workload requirements.
You can also set maximizeResourceAllocation to true in your spark configuration classification.
Add more Spark partitions
If you can't increase container memory, then increase the number of Spark partitions to reduce the amount of data that's processed and memory that's used.
To add more Spark partitions, first connect the primary node of your cluster and then run the following commands in Spark shell:
val numPartitions = 500
val newDF = df.repartition(numPartitions)
Note: Replace 500 with the number of partitions that fit your data size.
Increase the number of shuffle partitions
If the issue occurs during a wide transformation, such as a join or groupBy, then connect to the primary node of your cluster and add more shuffle partitions. The default value is 200.
Running cluster
To open spark-defaults.conf, run the following command:
sudo vim /etc/spark/conf/spark-defaults.conf
To add shuffle partitions to your configuration file, run the following command:
spark.sql.shuffle.partitions 500
Note: Replace 500 with the number of partitions that fit your data size.
Single job
To add shuffle partitions, use the --conf spark.sql.shuffle.partitions option when you run spark-submit:
spark-submit
--conf
spark.sql.shuffle.partitions=500
...
Note: Replace 500 with the number of partitions that fit your data size.
Reduce the number of executor cores
When you reduce the number of executor cores, you also reduce the maximum number of tasks that the executor processes simultaneously. This reduces the amount of memory that the container uses. To reduce the number of executor cores, first connect to the primary node of your cluster and then modify your executor cores parameter.
Running cluster
Open the spark-defaults.conf file on the primary node:
sudo vim /etc/spark/conf/spark-defaults.conf
To reduce the number of executor cores, modify the spark.executor.cores parameter:
spark.executor.cores 1
Note: Replace 1 with the minimum number of executor cores that you require.
Single job
To reduce the number of executor cores, use the --executor-cores option when you run spark-submit:
spark-submit
--executor-cores 1
...
Increase instance size
When the operating system (OS) runs out of memory, the OS oom_reaper might also stop the YARN containers. If oom_reaper caused the error, then use a larger Amazon Elastic Compute Cloud (Amazon EC2) instance with more RAM. To make sure that YARN containers don't use all the RAM, decrease yarn.nodemanager.resource.memory-mb.
To determine whether oom_reaper caused the error, review your Amazon EMR Instance logs for the dmesg command output. First, use the YARN Resource Manger UI or logs to find the core or task node where the stopped YARN container ran. Then, check the Amazon EMR instance state logs on the node before and after the container stopped.
In the following example, the kernel stops the process that has the ID 36787 and corresponds to YARN container_165487060318_0001_01_000244:
# hows the kernel lookingdmesg | tail -n 25
[ 3910.032284] Out of memory: Kill process 36787 (java) score 96 or sacrifice child
[ 3910.043627] Killed process 36787 (java) total-vm:15864568kB, anon-rss:13876204kB, file-rss:0kB, shmem-rss:0kB
[ 3910.748373] oom_reaper: reaped process 36787 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Check for disk utilization and node degradation
If the preceding troubleshooting options don't resolve the issue, then use the df-h flag in the instance state logs to check the disk utilization and node degradation. Also check the node condition on the AWS Health Dashboard.
Related information
How do I resolve the error "Container killed by YARN for exceeding memory limits" in Spark on Amazon EMR?
How do I troubleshoot stage failures in Spark jobs on Amazon EMR?