Why isn't my EMR cluster scaling even though I have managed scaling turned on or resizing metrics were met?

4 minute read

I have managed scaling turned on or resizing metrics were met on my Amazon EMR cluster, but the cluster isn't scaling.


The following are common reasons why your EMR cluster might not scale even though managed scaling is turned on or resizing metrics were met:

The thresholds set in Amazon CloudWatch metrics for scaling aren't met

Automatic scaling depends on CloudWatch metrics. If the corresponding metrics thresholds aren't met for scaling up or down, then scaling doesn't happen.

Check the Amazon EMR metrics in Amazon CloudWatch to verifying that the metrics set in your scaling rules are being populated. For example, verify that the ContainerPendingRatio, YARNMemoryAvailablePercentage, and so on, are populated as defined in your scaling rules.

The following are common reasons that Amazon EMR metrics aren't populating as expected in CloudWatch:

  • The file /etc/hadoop/conf/hadoop-metrics2.properties doesn't exist or is corrupted. For example, the file might have been overwritten by a custom bootstrap action.
  • There might be issues with metrics-related components such as Hadoop, YARN, and so on. Review the corresponding application logs to check for errors.
  • For managed scaling, verify that the MetricsCollector daemon is running by running the sudo systemctl status MetricsCollector command on the primary node.

You're using applications that aren't YARN-based

Applications such as Presto that aren't based on YARN use scaling methods based on metrics generated by YARN. So, clusters won't scale even if the Presto query utilization is high. If you're using applications that aren't YARN-based, use manual scaling. For example, you can set the Amazon EMR resize API to use custom Presto metrics.

The core or task instance groups are in a suspended or arrested state

Core or task instance groups in a suspended or arrested state become stuck when resizing or scaling. For troubleshooting steps, see Suspended state.

Reconfigurations cause instance groups to be in an arrested state. For more information, see Troubleshoot instance group reconfiguration.

There are HDFS application issues in EMR causing issues when scaling core nodes

It's a best practice to keep core nodes fixed if the following are true:

  • You store data in Amazon Simple Storage Service (Amazon S3) buckets, and
  • HDFS utilization is at a minimum.

Scale task nodes only to avoid HDFS issues.

Scaling core nodes takes longer than scaling task nodes. This is because core nodes have an additional service (Datanode) that's used to store the HDFS data. Decommissioning HDFS data takes time. If your use case requires core node scaling and the scaling is stuck, then there might be an issue with HDFS decommissioning. Check the following items to troubleshoot scaling that's stuck due to HDFS decommissioning:

  • Check the HDFS services health (Namenode and Datanode).
  • Verify if there are any missing, corrupted, or under-replicated blocks by running the hdfs dfsadmin -report command.
  • Verify if there are any core nodes that are unhealthy due to disk, memory, or CPU issues.
  • Determine if the HDFS replication factor is set to a higher number such as 3 or 2. If the replication factor is set to 3 or 2 and you try to scale down the core nodes to 1, scaling becomes stuck. This is because minimum replicas must be maintained.

The requested capacity isn't available in Amazon EMR

If the requested Amazon Elastic Compute Cloud (Amazon EC2) capacity isn't available in Amazon EMR, then scaling fails after the timeout period. Perform a manual resize if scaling is stuck for a long period and you receive insufficient capacity errors in AWS CloudTrail events. 2 to 3 hours is considered a long period of time for scaling to be stuck.

Related information

Use automatic scaling with a custom policy for instance groups

Manually resize a running cluster

Using managed scaling in Amazon EMR

Top 9 performance tuning tips for PrestoDB on Amazon EMR

AWS OFFICIALUpdated a year ago
No comments

Relevant content