Is it possible to dynamically change the capacity-upfront for EMR cluster using Data Pipeline?

0

Main problem

I understand that is no need to add Auto Scaling to an EMR cluster launched by Data Pipeline. Instead, we can specify the capacity up-front and it will be used for the duration of the job. But what if I am doing transformation on some data weekly basis and the instance type size needed do this every week, and I can't be sure about how many nodes are required in the cluster for better performance?

Possible solution?

At the moment, I can predict the amount of data that EMR could process thanks to the number of events tracked in a period of time on OpenSearch (where EMR will extract data), E.g. If 1 EMR node can handle 1,000 events and the actual number of events was 10,000 then create 10 nodes.

I though on create an EventBridge cron job to execute a Lambda function ~10 minutes before the Data pipeline process and calculate the number of nodes, then store the value in a service like SSM parameter store. So when the Data Pipeline starts, I can be able to retrieve the value and pass it as a parameter for the task.

This may sound a little complicated so I would like to know if there's maybe an easier way to achieve this, thanks in advance!

답변 없음

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠