My Amazon EMR jobs are stuck in the Accepted state and logs show "WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources".
EMR jobs become stuck in the Accepted state if the cluster doesn't have enough resources to fulfil the job request. This might happen for the following reasons:
- The YARNMemoryAvailablePercentage is very low and many of the containers are pending.
- The application can't start an application master due to insufficient resources on the core nodes. This can occur on Amazon EMR 5.19.0 and later, excluding the Amazon EMR 6.x series.
- The core nodes are unhealthy.
- One EMR job is consuming all the resources.
The cluster has insufficient resources to fulfill the job request
1. Connect to the Resource Manager UI or use the following command from any node to check the resources:
yarn top 10
2. Check if the Used Resources are almost equivalent to the Total Resources. You can also check the Amazon CloudWatch metrics for the YARNMemoryAvailablePercentage and MemoryAvailableMB.
4. If needed, add more capacity to the cluster. You can use EMR Managed Scaling or automatic scaling to automatically add or shrink capacity based on resource utilization.
There are insufficient resources on the core nodes.
On EMR 5.19.0 and later, excluding the 6.0 series, the application master runs on the core node by default. In EMR 6.x series, the application master can run on both the core and task nodes.
Due to the increased number of submitted jobs, and fewer core nodes, the core nodes can't allocate another application master container. So the job might become stuck even though the task nodes have enough memory. If this occurs, you might see the following message in the container logs:
Application is Activated, waiting for resources to be assigned for AM. Last Node which was processed for the application : ip-xxxx:8041 ( Partition : , Total resource :
<memory:516096, vCores:64>, Available resource : <memory:516096,vCores:64> ). Details : AM Partition = CORE ; Partition Resource =
<memory:516096, vCores:64> ; Queue's Absolute capacity = 100.0 % ; Queue's Absolute used capacity = 99.53497 % ; Queue's Absolute max capacity =100.0 %
If this occurs, terminate jobs to free some resources. Or, add more core nodes to the cluster.
Additionally, you can turn off YARN labels in Amazon EMR 5.x.
The core nodes are unhealthy
If the core nodes run out of disk space and the mount point has disk utilization over 90%, then Amazon EMR considers the node unhealthy. New containers aren't scheduled on unhealthy nodes. If this occurs, following message appears in the primary instance's controller logs. Logs are located at /emr/instance-controller/log.
Yarn unhealthy Reason : 1/4 local-dirs usable space is below configured utilization percentage/no more usable space [ /mnt/yarn : used space above threshold of 90.0% ] ; 1/1 log-dirs usable space
is below configured utilization percentage/no more usable space [ /var/log/hadoop-yarn/containers : used space above threshold of 90.0% ]
To correct unhealthy nodes, reduce disk usage by removing old container logs or spark event logs. You can also dynamically scale storage based on disk utilization.
One job is consuming all the resources or Spark parameters are configured beyond the cluster limits
Spark Dynamic Allocation is turned on by default in Amazon EMR. If a Spark job isn't properly configured, then the job might consume all the cluster's resources. For example, you get this error if the Max Executors (spark.dynamicAllocation.maxExecutors) limit isn't high enough. For Spark jobs, tune the memory constraints to avoid one job consuming all of the cluster's resources.
Job acceptance fails if the executor memory or the driver memory are more than the Yarn configured parameters. The Yarn configured parameters are yarn.scheduler.maximum-allocation-mb and yarn.nodemanager.resource.memory-mb. If this occurs, you see an error message similar to the following:
22/01/03 20:05:05 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (24576
MB per container)
Exception in thread "main" java.lang.IllegalArgumentException: Required executor memory (49152),overhead (6144 MB), and PySpark memory (0 MB) is above the max threshold
(24576 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
To resolve this, do the following: