The user cache for my Apache Hadoop or Apache Spark job uses all the disk space on the partition. The Amazon EMR job fails or the HDFS NameNode service is in safe mode.
Short description
On an Amazon EMR cluster, YARN is configured to allow jobs to write cache data to /mnt/yarn/usercache. When you process a large amount of data or run multiple concurrent jobs, the /mnt file system might fill. This causes node manager to fail on some nodes, and the job freezes or fails.
To resolve this issue, use one of the following methods:
- If you don't have long-running or streaming jobs, then adjust the user cache retention settings for YARN NodeManager.
- If you have long-running or streaming jobs, then scale up the Amazon Elastic Block Store (Amazon EBS) volumes.
Resolution
Adjust the user cache retention settings for NodeManager
The following attributes define the cache cleanup settings:
- yarn.nodemanager.localizer.cache.cleanup.interval-ms: This is the cache cleanup interval. The default value is 600,000 milliseconds. After this interval, when the cache size exceeds the set value in yarn.nodemanager.localizer.cache.target-size-mb, the files that the running containers don't use are deleted.
- yarn.nodemanager.localizer.cache.target-size-mb: This is the maximum disk space that's allowed for the cache. The default value is 10,240 MB. When the cache disk size exceeds this value, files that the running containers don't use are deleted on the interval that's set in yarn.nodemanager.localizer.cache.cleanup.interval-ms.
To set the cleanup interval and maximum disk space size on your cluster, complete the following steps:
-
Open /etc/hadoop/conf/yarn-site.xml on each core and task node.
-
Reduce the values for yarn.nodemanager.localizer.cache.cleanup.interval and yarn.nodemanager.localizer.cache.target-size-mb for each core and task node.
For example:
sudo vim /etc/hadoop/conf/yarn-site.xmlyarn.nodemanager.localizer.cache.cleanup.interval-ms 400000
yarn.nodemanager.localizer.cache.target-size-mb 5120
-
Run the following commands on each core and task node to restart NodeManager:
EMR 5.29 and earlier
sudo stop hadoop-yarn-nodemanager
sudo start hadoop-yarn-nodemanager
EMR 5.30.0 and later
sudo stop hadoop-yarn-nodemanager.service
sudo systemctl start hadoop-yarn-nodemanager.service
-
To set the cleanup interval and maximum disk space size on a new cluster at launch, add a configuration object similar to the following one:
[
{
"Classification": "yarn-site",
"Properties": {
"yarn.nodemanager.localizer.cache.cleanup.interval-ms": "400000",
"yarn.nodemanager.localizer.cache.target-size-mb": "5120"
}
}
]
The deletion service doesn't complete on running containers. This means that even after you adjust the user cache retention settings, data might still spill to the following path and fill the file system:
{'yarn.nodemanager.local-dirs'}/usercache/user/appcache/application_id ,
Scale up the EBS volumes on the EMR cluster nodes
To scale up storage on a running EMR cluster, see Dynamically scale up storage on Amazon EMR clusters.
To scale up storage on a new EMR cluster, specify a larger volume size when you create the EMR cluster. You can also do this when you add nodes to an existing cluster:
- Amazon EMR versions 5.22.0 and later: The default amount of EBS storage increases based on the size of the Amazon Elastic Compute Cloud (Amazon EC2) instance. For more information about the default amount of storage and number of volumes for each instance type, see Default Amazon EBS storage for instances.
- Amazon EMR versions 5.21 and earlier: The default EBS volume size is 32 GB. 27 GB is reserved for the /mnt partition. HDFS, YARN, the user cache, and all applications use the /mnt partition. Increase the size of your EBS volume as needed. You can also specify multiple EBS volumes that are mounted as /mnt1 or /mnt2.
For Spark streaming jobs, you can also perform an RDD.unpersist() after you no longer need the data. Or, explicitly call System.gc() in Scala or sc._jvm.System.gc() in Python to start JVM garbage collection and remove the intermediate shuffle files.