Is it possible to dynamically change the capacity-upfront for EMR cluster using Data Pipeline?

0

Main problem

I understand that is no need to add Auto Scaling to an EMR cluster launched by Data Pipeline. Instead, we can specify the capacity up-front and it will be used for the duration of the job. But what if I am doing transformation on some data weekly basis and the instance type size needed do this every week, and I can't be sure about how many nodes are required in the cluster for better performance?

Possible solution?

At the moment, I can predict the amount of data that EMR could process thanks to the number of events tracked in a period of time on OpenSearch (where EMR will extract data), E.g. If 1 EMR node can handle 1,000 events and the actual number of events was 10,000 then create 10 nodes.

I though on create an EventBridge cron job to execute a Lambda function ~10 minutes before the Data pipeline process and calculate the number of nodes, then store the value in a service like SSM parameter store. So when the Data Pipeline starts, I can be able to retrieve the value and pass it as a parameter for the task.

This may sound a little complicated so I would like to know if there's maybe an easier way to achieve this, thanks in advance!

Aucune réponse

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions