Estimate EMR Small Cluster Capacity

0

Hi team We want to use an EMR Cluster to process data with spark jobs We have 30,000 files per day and approximately 2Gb of information, later it is planned that this will grow. We have a small cluster planned, using an m5.xlarge instance 2 primary nodes 3 core nodes (with 3 node instances only one block of data is stored in HDFS) and 2 task nodes

Do you think the planning of this cluster is correct? Or do you recommend another type of instance? Or some other cluster configuration?

Thanks

질문됨 3달 전290회 조회
1개 답변
0
수락된 답변

Hello,

I understand that you are requesting for recommendations to create EMR cluster in order to run your spark jobs. Please go through the below suggestions and recommendations for the same,

➤ Firstly, Please be informed that creation of an EMR cluster can only be done in the following ways[1]

  1. An EMR cluster with only a single primary node.
  2. A Multi primary cluster with a fixed number of 3 primary nodes.

➤ By looking at the overall size '2GB' of the jobs that you mentioned, I can infer that the instance type 'm5.xlarge' can be sufficient for the use case.[2]

➤ Further, I would also recommend enabling the 'scaling' option for your EMR cluster which helps in resizing of your cluster as per your workload. Please refer the below document in order to gather more insights about scaling. [3]

However, kindly note that without the visibility into the computation of your job run, it's not possible to determine the optimised compatibility of any specific instance type for your cluster. Hence, it is recommended to test your use case in a test environment before implementing directly in your production.

➤ Additionally, I suggest you to go through the below document for EMR best practices which helps you to determine right infrastructure for your workloads.[4]

References :

[1] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha-launch.html#emr-plan-ha-launch-examples

[2] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html

[3] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-scale-on-demand.html

[4] https://aws.github.io/aws-emr-best-practices/docs/bestpractices/Applications/Spark/best_practices/

[5] https://www.amazonaws.cn/en/elasticmapreduce/faqs/

AWS
Veera_G
답변함 3달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인