Skip to content

Why doesn’t my EMR cluster scale even though I have managed scaling turned on or resizing metrics are met?

4 minute read
0

I have managed scaling turned on or the resizing metrics are met, but my EMR cluster doesn’t scale.

Resolution

The thresholds set in Amazon CloudWatch metrics for scaling aren't met

Automatic scaling depends on Amazon CloudWatch metrics. If the corresponding metrics thresholds aren't met for scaling up or down, then scaling doesn't happen.

Check the Amazon EMR metrics in Amazon CloudWatch to verify that the metrics set in your scaling rules such as are ContainerPendingRatio and YARNMemoryAvailablePercentage populate.

Amazon EMR metrics might not populate as expected in CloudWatch for one of the following reasons:

  • The /etc/hadoop/conf/hadoop-metrics2.properties file doesn't exist or it's corrupted. For example, a custom bootstrap action might have overwritten the file.
  • There might be issues with metrics-related components such as Hadoop or YARN. Review the corresponding application logs to check for errors.
  • The MetricsCollector daemon might not be running. For managed scaling, run the following command on the primary node to check whether the MetricsCollector daemon is running:
    sudo systemctl status MetricsCollector

You use applications that aren't YARN-based

Applications such as Presto that aren't based on YARN use scaling methods based on metrics generated by YARN. So, clusters won't scale even if the Presto query utilization is high. If you're using applications that aren't YARN-based, use manual scaling. For example, you can set the Amazon EMR resize API to use custom Presto metrics.

The core or task instance groups are in a suspended or arrested state

Core or task instance groups in a suspended or arrested state become stuck when they resize or scale. For more information, see Suspended state.

Reconfigurations might cause instance groups to go into an arrested state. For more information, see Troubleshoot instance group reconfiguration.

There are HDFS application issues in Amazon EMR that cause issues when you scale core nodes

If both the following conditions are true, then it's a best practice to keep core nodes fixed:

  • You store data in Amazon Simple Storage Service (Amazon S3) buckets.
  • Hadoop Distributed File System (HDFS) utilization is at a minimum.

Note: It's a best practice to scale task nodes only to avoid HDFS issues.

It takes longer to scale core nodes than to scale task nodes. This is because core nodes have an additional service (Datanode) that's used to store the HDFS data. Decommissioning HDFS data takes time. If your use case requires core node scaling and the scaling is stuck, then there might be an issue with HDFS decommissioning.

To troubleshoot scaling that's stuck because of HDFS decommissioning, take the following actions:

  • Check the HDFS services health (Namenode and Datanode).
  • Run the hdfs dfsadmin -report command to check whether there are any missing, corrupted, or under-replicated blocks.
  • Check whether there are any core nodes that are unhealthy because of disk, memory, or CPU issues.
  • Check whether you set the HDFS replication factor to a high number such as 3 or 2. If you try to scale the core node down to 1 when the replication factor is set to 3 or 2, then the scaling operation becomes stuck. This is because Amazon EMR must maintain the minimum number of replicas.

The requested capacity isn't available in Amazon EMR

If the requested Amazon Elastic Compute Cloud (Amazon EC2) capacity isn't available in Amazon EMR, then scaling fails after the timeout period. If scaling is stuck for longer than 2 to 3 hours and you receive insufficient capacity errors in AWS CloudTrail events, then perform a manual resize.

Related information

Using automatic scaling with a custom policy for instance groups in Amazon EMR

Manually resize a running Amazon EMR cluster

Using managed scaling in Amazon EMR

Top 9 performance tuning tips for PrestoDB on Amazon EMR

AWS OFFICIALUpdated a month ago