Estimate EMR Small Cluster Capacity

0

Hi team We want to use an EMR Cluster to process data with spark jobs We have 30,000 files per day and approximately 2Gb of information, later it is planned that this will grow. We have a small cluster planned, using an m5.xlarge instance 2 primary nodes 3 core nodes (with 3 node instances only one block of data is stored in HDFS) and 2 task nodes

Do you think the planning of this cluster is correct? Or do you recommend another type of instance? Or some other cluster configuration?

Thanks

asked 3 months ago279 views
1 Answer
0
Accepted Answer

Hello,

I understand that you are requesting for recommendations to create EMR cluster in order to run your spark jobs. Please go through the below suggestions and recommendations for the same,

➤ Firstly, Please be informed that creation of an EMR cluster can only be done in the following ways[1]

  1. An EMR cluster with only a single primary node.
  2. A Multi primary cluster with a fixed number of 3 primary nodes.

➤ By looking at the overall size '2GB' of the jobs that you mentioned, I can infer that the instance type 'm5.xlarge' can be sufficient for the use case.[2]

➤ Further, I would also recommend enabling the 'scaling' option for your EMR cluster which helps in resizing of your cluster as per your workload. Please refer the below document in order to gather more insights about scaling. [3]

However, kindly note that without the visibility into the computation of your job run, it's not possible to determine the optimised compatibility of any specific instance type for your cluster. Hence, it is recommended to test your use case in a test environment before implementing directly in your production.

➤ Additionally, I suggest you to go through the below document for EMR best practices which helps you to determine right infrastructure for your workloads.[4]

References :

[1] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha-launch.html#emr-plan-ha-launch-examples

[2] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html

[3] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-scale-on-demand.html

[4] https://aws.github.io/aws-emr-best-practices/docs/bestpractices/Applications/Spark/best_practices/

[5] https://www.amazonaws.cn/en/elasticmapreduce/faqs/

AWS
Veera_G
answered 3 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions