Can Data Pipelines be used for running Spark jobs on EMR 6.5.0?



I have a problem in that I make heavy use of EMRs, and I orchestrate their use with Data Pipeline - multiple daily runs are automated and EMRs are launched and terminated on conclusion.

However, I'd now like to make use of EMR 6.X.X releases via Data Pipelines, rather than EMR 5.X.X releases I'm currently using. This is for two main reasons:

  • Security compliance: the latest EMR 6.X.X release have less vulnerabilities than the latest EMR 5.X.X releases
  • Performance/functionality: EMR 6.X.X releases perform much better than EMR 5.X.X releases for what I'm doing, and have functionality I prefer to use

However...the current documentation for Data Pipeline says the following regarding EMR versions:

AWS Data Pipeline only supports release version 6.1.0 (emr-6.1.0).

Version 6.1.0 of EMR was last updated on Oct 15,'s pretty old.

Now, if I try and use an EMR version > 6.1.0 with Data Pipeline, I get the issue that has already been raised here i.e. during initial EMR cluster bring up via the Data Pipeline, there's a failure that renders the cluster unusable. It looks like a malformed attempt to create a symbolic link to a jar by one of the AWS scripts:

++ find /usr/lib/hive/lib/ -name 'opencsv*jar'
+ open_csv_jar='/usr/lib/hive/lib/opencsv-2.3.jar
+ sudo ln -s /usr/share/aws/emr/goodies/lib/emr-hadoop-goodies-3.2.0.jar /mnt/taskRunner/oncluster-emr-hadoop- goodies.jar
+ sudo ln -s /usr/share/aws/emr/goodies/lib/emr-hive-goodies-3.2.0.jar /mnt/taskRunner/oncluster-emr-hive-goodies.jar
+ sudo ln -s /usr/lib/hive/lib/opencsv-2.3.jar /usr/lib/hive/lib/opencsv-3.9.jar /mnt/taskRunner/open-csv.jar
ln: target ‘/mnt/taskRunner/open-csv.jar’ is not a directory
Command exiting with ret '1'

So - I guess my questions are:

  1. Is there a way to work around the above so that Data Pipelines can be used to launch EMR 6.5.0 clusters for Spark jobs?
  2. If there isn't, is there a different way of automating runs of EMR 6.5.0 clusters, other than writing my own script and scheduling that to bring up the EMR cluster and add the required jobs/steps?


preguntada hace 2 años402 visualizaciones
1 Respuesta

With this banner in the console, "Please note that Data Pipeline service is in maintenance mode and we are not planning to expand the service to new regions. We plan to remove console access by 05/12/2023", I would focus development of new workloads in Managed Apache Airflow. Using this operator, you can kick off a dag that launches an EMR cluster and terminates it one job is complete, similar to what we used to do in data pipeline.

respondido hace un año

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas