presto autoscaling EMR on EC2 not working

0

Below AWS create command is successfully fired in CLI , But in the events the EMR is not able to attach the Autoscaling policy to EMR cluster :

aws emr create-cluster
--name "emr-creds-presto-autoscale-alarm_v3"
--release-label "emr-6.9.0"
--service-role "arn:aws:iam::xxxxxxx386:role/EMR-reducer-role"
--ec2-attributes '{"InstanceProfile":"EMR-reducer-role","EmrManagedMasterSecurityGroup":"sg-0f53f","EmrManagedSlaveSecurityGroup":"sg-0553f","SubnetId":"subnet-0c20022a"}'
--applications Name=Presto
--configurations '[{"Classification":"presto-connector-hive","Properties":{"hive.metastore.glue.datacatalog.enabled":"true"}}]'
--instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m5.2xlarge","Name":"Master - 1"},{"InstanceCount":2,"AutoScalingPolicy":{"Constraints":{"MinCapacity":2,"MaxCapacity":20},"Rules":[{"Action":{"SimpleScalingPolicyConfiguration":{"ScalingAdjustment":2,"CoolDown":300,"AdjustmentType":"CHANGE_IN_CAPACITY"}},"Description":"","Trigger":{"CloudWatchAlarmDefinition":{"MetricName":"PrestoAvgQueryTimeIncrease","ComparisonOperator":"GREATER_THAN_OR_EQUAL","Statistic":"AVERAGE","Period":300,"Dimensions":[{"Value":"${emr.clusterId}","Key":"JobFlowId"}],"EvaluationPeriods":3,"Unit":"PERCENT","Namespace":"AWS/ElasticMapReduce","Threshold":5}},"Name":"scaleout-avg-query-time-inc"},{"Action":{"SimpleScalingPolicyConfiguration":{"ScalingAdjustment":2,"CoolDown":300,"AdjustmentType":"CHANGE_IN_CAPACITY"}},"Description":"","Trigger":{"CloudWatchAlarmDefinition":{"MetricName":"PrestoQueryCount","ComparisonOperator":"GREATER_THAN_OR_EQUAL","Statistic":"AVERAGE","Period":300,"Dimensions":[{"Value":"${emr.clusterId}","Key":"JobFlowId"}],"EvaluationPeriods":2,"Unit":"COUNT","Namespace":"AWS/ElasticMapReduce","Threshold":10}},"Name":"scaleout-querycount"},{"Action":{"SimpleScalingPolicyConfiguration":{"ScalingAdjustment":18,"CoolDown":300,"AdjustmentType":"CHANGE_IN_CAPACITY"}},"Description":"","Trigger":{"CloudWatchAlarmDefinition":{"MetricName":"ScaleOutToMax","ComparisonOperator":"GREATER_THAN_OR_EQUAL","Statistic":"AVERAGE","Period":300,"Dimensions":[{"Value":"${emr.clusterId}","Key":"JobFlowId"}],"EvaluationPeriods":1,"Unit":"COUNT","Namespace":"AWS/ElasticMapReduce","Threshold":1}},"Name":"scaleoutto-max"},{"Action":{"SimpleScalingPolicyConfiguration":{"ScalingAdjustment":-2,"CoolDown":300,"AdjustmentType":"CHANGE_IN_CAPACITY"}},"Description":"","Trigger":{"CloudWatchAlarmDefinition":{"MetricName":"PrestoQueryCount","ComparisonOperator":"LESS_THAN_OR_EQUAL","Statistic":"AVERAGE","Period":300,"Dimensions":[{"Value":"${emr.clusterId}","Key":"JobFlowId"}],"EvaluationPeriods":5,"Unit":"COUNT","Namespace":"AWS/ElasticMapReduce","Threshold":1}},"Name":"scaleinquery-count"},{"Action":{"SimpleScalingPolicyConfiguration":{"ScalingAdjustment":-18,"CoolDown":300,"AdjustmentType":"CHANGE_IN_CAPACITY"}},"Description":"","Trigger":{"CloudWatchAlarmDefinition":{"MetricName":"ScaleInToMin","ComparisonOperator":"GREATER_THAN_OR_EQUAL","Statistic":"AVERAGE","Period":300,"Dimensions":[{"Value":"${emr.clusterId}","Key":"JobFlowId"}],"EvaluationPeriods":1,"Unit":"COUNT","Namespace":"AWS/ElasticMapReduce","Threshold":1}},"Name":"scaleintomin"}]},"InstanceGroupType":"CORE","InstanceType":"m5.2xlarge","Name":"Core -2"}]'
--scale-down-behavior "TERMINATE_AT_TASK_COMPLETION"
--ebs-root-volume-size "15"
--auto-termination-policy '{"IdleTimeout":3600}'
--os-release-label "2.0.20221210.1"
--auto-scaling-role "arn:aws:iam::xxxxxx386:role/EMR-reducer-role"
--region "us-east-2"

After adding the necessary permissions/policies to arn:aws:iam::xxxxxx386:role/EMR-reducer-role the autoscaling policy got attached to the cluster , but when I am expecting the Auto scaling to happen i.e. adding of new core nodes on query long run time it is not happening.

1 Answer
0

Hello,

The configuration seems normal and this is confirmed by the successful CLI command that did not fail with the above autoscaling rule. However, from the above, I do not see that there is any kind of configuration that pushes Presto logs to cloudwatch. To get better query performance and minimize cost, automatic scaling based on Presto metrics is highly recommended. Because CloudWatch doesn’t collect Presto-specific metrics, custom code and configuration are required to push these Presto-specific metrics to CloudWatch.

Presto exposes many metrics on JVM, cluster, nodes, tasks, and connectors through Java Management Extension (JMX). Presto also provides a REST API to access these JMX properties. You can use many of these metrics to scale the Presto cluster on your query workloads.

To be able to use the JMX connector as mentioned here: https://prestodb.io/docs/current/connector/jmx.html

As I did not notice the below configuration in the above CLI command: [{"classification":"trino-connector-jmx", "properties":{"connector.name ":"jmx"}, "configurations":[]}]

Then I assume the jmx connector is not set to send metrics to cloudwatch causing the Autoscaling not to be triggered correctly.

Moving ahead, you will need a bootstrap action to push the following Presto metrics to CloudWatch (and to Ganglia if it's installed on EMR)

  1. PrestoNumRunningQueries - number of currently running queries
  2. PrestoNumQueuedQueries - number of queued queries
  3. PrestoNumWorkerNodes - number of Presto worker nodes, this can be used for monitoring the cluster size on CloudWatch graphs
  4. PrestoAvgQueryTime5m - average query time in the past 5 minute interval
  5. PrestoAvgQueryTime5mInc - average query time increase in the past 5 minute interval
  6. PrestoAvgQueuedTime5m - average queued time for queries in the past 5 minute interval
  7. PrestoNumFailedQueries - number of failed queries
  8. PrestoNumAbandonedQueries - number of abandoned queries
  9. PrestoNumCanceledQueries - number of canceled queries
  10. PrestoNumCompletedQueries - number of completed quriees

Below are some sample autoscaling rules that make use of some of these metrics:

Scale Out Policies:

Metric Name	Comparison Operator	Threshold	Unit	Scaling Adjustment	Period	Evaluation Periods
PrestoAvgQueryTime5mInc	GREATER_THAN_OR_EQUAL	25	PERCENT	2	300	3
PrestoNumRunningQueries	GREATER_THAN_OR_EQUAL	10	COUNT	2	300	2
PrestoNumQueuedQueries	GREATER_THAN_OR_EQUAL	5	COUNT	2	300	2

Scalin In Policies:

Metric Name	Comparison Operator	Threshold	Unit	Scaling Adjustment	Period	Evaluation Periods
PrestoNumRunningQueries	LESS_THAN_OR_EQUAL	1	COUNT	-2	300	5

I would suggest to use a combination of above metrics for your auto-scaling policy that best suits your requirement. And I also request you to thoroughly test these policies in a test environment before moving to production.

*** Please note that there is currently a limit of 5 rules per custom automatic scaling policy in EMR and it's not possible to increase the limit at this time. ***

Also please check out this article [1], it is a very useful guide on performance tuning tips for Presto on EMR.

References: [1] https://aws.amazon.com/blogs/big-data/top-9-performance-tuning-tips-for-prestodb-on-amazon-emr/

Which will help in pushing the below metrics do cloudwatc

AWS
Omar_E
answered 10 months ago
  • Hi Omar, I'm installing a Trino cluster on EMR and my "custom automatic scaling" option also doesn't work, my scale rules based on MemoryAvailableMB which is available on EMR monitoring options, so I just need to add [{"classification":"trino-connector-jmx", "properties":{"connector.name ":"jmx"}, "configurations":[]}] in the config and will it work correctly? Besides, I get another issue if I set "query.max-memory-per-node" option in "trino-config", trino will get an error and shut down immediately, can you help me? Thanks!!

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions