Skip to content

Why are my Yarn applications in Amazon EMR stuck in the Accepted state?

4 minute read
0

My Amazon EMR jobs are stuck in the Accepted state and logs show a "WARN YarnScheduler" message.

Short description

When the cluster doesn't have enough resources to fulfil the job request, Amazon EMR jobs are stuck in the Accepted state and you see the following message in the logs:

"WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources"

Resolution

The cluster has insufficient resources

Your Yarn applications in Amazon EMR might be stuck in the Accepted state when the YARNMemoryAvailablePercentage is very low and many of the containers are pending.

To resolve this issue, complete the following steps:

  1. Connect to the Resource Manager UI. Or use the following command from any node to check the number of resources:

    yarn top 10
  2. Check whether Used Resources is almost equivalent to Total Resources. You can also check the Amazon CloudWatch metrics for the YARNMemoryAvailablePercentage and MemoryAvailableMB.

  3. If the cluster has insufficient resources to fulfill the job request, then add more capacity to the cluster. You can use Amazon EMR managed scaling or automatic scaling to automatically increase or decrease capacity based on resource utilization.

The core nodes have insufficient resources

On Amazon EMR 5.19.0 and later, except for version 6 and 7, the application master runs on the core node by default. In Amazon EMR versions 6 and 7, the application master can run on both the core and task nodes.

When you have an increased number of submitted jobs, and fewer core nodes, the core nodes can't allocate another application master container. So the job might get stuck even when the task nodes have enough memory. If this occurs, then you might see the following message in the container logs:

"Application is Activated, waiting for resources to be assigned for AM. Last Node which was processed for the application : ip-####:8041 ( Partition : [], Total resource :<memory:516096, vCores:64>, Available resource : memory:516096,vCores:64 ). Details : AM Partition = CORE ; Partition Resource = <memory:516096, vCores:64> ; Queue's Absolute capacity = 100.0 % ; Queue's Absolute used capacity = 99.53497 % ; Queue's Absolute max capacity 100 %"

If you see the preceding message, then terminate jobs to free some resources. Or, add more core nodes to the cluster.

You can also turn off YARN labels in Amazon EMR version 5.

The core nodes are unhealthy

If the core nodes run out of disk space and the mount point has disk utilization over 90%, then Amazon EMR considers the node unhealthy. Amazon EMR doesn't schedule new containers on unhealthy nodes. If the core nodes are unhealthy, then you see the following message in the primary instance's controller logs:

"Yarn unhealthy Reason : 1/4 local-dirs usable space is below configured utilization percentage/no more usable space [ /mnt/yarn : used space above threshold of 90.0% ] ; 1/1 log-dirs usable spaceis below configured utilization percentage/no more usable space [ /var/log/hadoop-yarn/containers : used space above threshold of 90.0% ]"

Note: Logs are located at /emr/instance-controller/log.

To resolve this issue, remove old container logs or Apache spark event logs to reduce disk usage. You can also dynamically scale storage based on disk utilization.

One Amazon EMR job consumes all the resources

By default, Amazon EMR turns on Spark Dynamic Allocation. If you don't properly configure a Spark job, then the job might consume all the cluster's resources. For example, you get an error when the Max Executors (spark.dynamicAllocation.maxExecutors) limit isn't high enough.

For Spark jobs, tune the memory constraints to make sure that one job doesn't consume all of the cluster's resources.

Job acceptance fails when the executor memory or the driver memory is more than the Yarn configured parameters, yarn.scheduler.maximum-allocation-mb and yarn.nodemanager.resource.memory-mb. The job fails with the following error message:

"22/01/03 20:05:05 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (24576 MB per container) Exception in thread "main" java.lang.IllegalArgumentException: Required executor memory (49152),overhead (6144 MB), and PySpark memory (0 MB) is above the max threshold (24576 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'."

To resolve the preceding error, take the following actions:

Related information

AWS Open Data Analytics

AWS OFFICIALUpdated 4 months ago