By using AWS re:Post, you agree to the Terms of Use

Multiple Spark Submits in Parallel


EMR - 6.4, Instance type : r5d.4xlarge, 1 Master node and 1 core node Trying to submit multiple spark jobs in parallel with equal resourcing allocated to each other. . Not able to submit more than 4 parallel jobs per node in above mentioned instance type . Is it one node can pick only 4 jobs in parallel with equal resourcing? or my spark configs needs to be upgraded.

Source : 10 tables with less than 100 records , no complex datatypes and one to one load

Spark configs given below :"spark.executor.cores,": 2 , "spark.executor.instances": 2 , "spark.executor.memory ":"5G" ,"spark.sql.shuffle.partitions": "5","spark.default.parallelism": "5","spark.dynamicAllocation.enabled": "false","spark.scheduler.mode" : "FAIR"

asked 7 months ago335 views
2 Answers

I understand that you are trying to run spark application in parallel and feel like you should be able to run more application in parallel but you are finding that you are able to run only 4 of them at the moment, hence wanted to know if there are any restrictions/limits based on the Instance Type.

As such there is no restriction on how many applications can be run in parallel, the limits are basically based on the resources available on the cluster.

EMR configures the Memory and Cores for YARN on each node according to the instance type used and in this case EMR configures YARN to allocate containers upto a total memory of 122880MB or 120GB for each Node with instance type as r5d.4xlarge. You can refer the docs to check the configurations on all instance type.

We have to keep in mind that Master doesn't run any Executor and only runs Driver if you have not mentioned "--deploy-mode cluster" in your "spak-submit" command as default is "--deploy-mode client". Also, since there is only one Core node hence both executors and ApplicationMaster (and driver if "--deploy-mode cluster") will be running on this single core node for each application.

To understand what was happening in your example, I ran 20 application(SparkPi example with same config shared) in a loop submitted as a background process (almost in parallel) and I could see 17 of them running at a time. At the same time not all the application had got their 2 executors as the first few application in the race to secure resource from Core node had already consumed the vcores and memory available on the Core Node. This clearly demonstrates you are only limited by the resources on your cluster, meaning once you increase the resources you will be able to run more application.

To test this further, I added 4 more nodes(core or task on EMR-6.x.x) and ran the spark applications with "--deploy-mode cluster" in a loop like before but this time tried to run more of them. I was able to see more than 60 spark application(SparkPi example with the config shared here) running in parallel just by adding 3 more core nodes and running the spark application with "--deploy-mode cluster".

However, the ResourceManager which is responsible for resource negotiation on the cluster then becomes the single point of failure and depending on the load may be stretched and could run into issues at some point as we increase the number of parallel applications. I have not tested the limits with respect to how many applications may cause issues in the ResourceManager, but confident that you could easily run 100+ application in parallel provided you have resources available on your cluster.

I hope this answers your questions, please do not hesitate to reply if you do have more questions on the same.

answered 7 months ago

Thanks Krishnadas_M for the detailed reply. Spark submit mode is Cluster only. With spark dynamic allocation set to true, 8 spark jobs are running in parallel for 1 master node and 1 core node. However, completion time of a job which is processing smaller file size is more than the larger files. For example , S3 Input file size - 1.4kb (file size) - 115 seconds(Run time) S3 Input file size - 40 kb (file size) - 70 seconds(Run time)

When i set the spark dynamic allocation set to false with resource allocated as executor cores as 2 , executor memory as 5GB , spark task cpu as 2 for each spark submit , then only 4 jobs are running in parallel. Here am able to see smaller files are getting processed earlier than the larger files. I understand increasing the nodes will increase the number of parallel jobs. If i want to process input files with less than 50kb or 100kb size with no transformations from source to target, am trying to find how many number of nodes are required to process N number of tables concurrently. Can i go with 4 jobs per node as per my testing results or my configuration has to be enhanced ?

answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions