I have managed scaling turned on or the resizing metrics are met, but my EMR cluster doesn’t scale.
Resolution
The thresholds set in Amazon CloudWatch metrics for scaling aren't met
Automatic scaling depends on Amazon CloudWatch metrics. If the corresponding metrics thresholds aren't met for scaling up or down, then scaling doesn't happen.
Check the Amazon EMR metrics in Amazon CloudWatch to verify that the metrics set in your scaling rules such as are ContainerPendingRatio and YARNMemoryAvailablePercentage populate.
Amazon EMR metrics might not populate as expected in CloudWatch for one of the following reasons:
You use applications that aren't YARN-based
Applications such as Presto that aren't based on YARN use scaling methods based on metrics generated by YARN. So, clusters won't scale even if the Presto query utilization is high. If you're using applications that aren't YARN-based, use manual scaling. For example, you can set the Amazon EMR resize API to use custom Presto metrics.
The core or task instance groups are in a suspended or arrested state
Core or task instance groups in a suspended or arrested state become stuck when they resize or scale. For more information, see Suspended state.
Reconfigurations might cause instance groups to go into an arrested state. For more information, see Troubleshoot instance group reconfiguration.
There are HDFS application issues in Amazon EMR that cause issues when you scale core nodes
If both the following conditions are true, then it's a best practice to keep core nodes fixed:
- You store data in Amazon Simple Storage Service (Amazon S3) buckets.
- Hadoop Distributed File System (HDFS) utilization is at a minimum.
Note: It's a best practice to scale task nodes only to avoid HDFS issues.
It takes longer to scale core nodes than to scale task nodes. This is because core nodes have an additional service (Datanode) that's used to store the HDFS data. Decommissioning HDFS data takes time. If your use case requires core node scaling and the scaling is stuck, then there might be an issue with HDFS decommissioning.
To troubleshoot scaling that's stuck because of HDFS decommissioning, take the following actions:
- Check the HDFS services health (Namenode and Datanode).
- Run the hdfs dfsadmin -report command to check whether there are any missing, corrupted, or under-replicated blocks.
- Check whether there are any core nodes that are unhealthy because of disk, memory, or CPU issues.
- Check whether you set the HDFS replication factor to a high number such as 3 or 2. If you try to scale the core node down to 1 when the replication factor is set to 3 or 2, then the scaling operation becomes stuck. This is because Amazon EMR must maintain the minimum number of replicas.
The requested capacity isn't available in Amazon EMR
If the requested Amazon Elastic Compute Cloud (Amazon EC2) capacity isn't available in Amazon EMR, then scaling fails after the timeout period. If scaling is stuck for longer than 2 to 3 hours and you receive insufficient capacity errors in AWS CloudTrail events, then perform a manual resize.
Related information
Using automatic scaling with a custom policy for instance groups in Amazon EMR
Manually resize a running Amazon EMR cluster
Using managed scaling in Amazon EMR
Top 9 performance tuning tips for PrestoDB on Amazon EMR