How to optimize a batch of Spark Jobs on EMR to reduce overall processing time by 4-5x?

0

A customer is running a batch of 25 nightly Spark jobs split across 2 EMR clusters processing in parallel. There are no dependencies between these jobs - They can all run in parallel. Overall they are fetching 250GB of data from these tables across all jobs. The job completion time varies from 20 minutes to 4 hours for each job. Their overall batch completion time is 12-14 hours. They need to cut this time down to 2-3 hours.

What will be the top-3 to 5 recommendations that they can try to achieve this in 1-2 weeks?

The Spark code is straightforward - 1) Run SparkSQL to read data over JDBC and load DataFrame, 2) Transform/Join DataFrames, 3) Write DataFrames to S3 partitions.

AWS
질문됨 3년 전440회 조회
1개 답변
0
수락된 답변

Hello. There are many factors to this. I am listing some below:

a) What is the instance configuration? Is it sufficient? Do you want to reconsider it?

b) Is auto scaling turned on?

c) What does Spark UI say? Which task takes most time? Is it task that takes more time or more time is spent on waiting for resources?

c) Read over JDBC , how many parallel connections are being used?

d) Are you using dynamic partitions?

These are some high level checklist which needs to be answered.

Most important is the code , are you using repartiton/coalesce? Are you using any collect in code? Code is the main factor which usually causes performance issues. Please feel free to reach out to me if you will need any additional information.

AWS
Sundeep
답변함 3년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인