Can Data Pipelines be used for running Spark jobs on EMR 6.5.0?

0

Hi,

I have a problem in that I make heavy use of EMRs, and I orchestrate their use with Data Pipeline - multiple daily runs are automated and EMRs are launched and terminated on conclusion.

However, I'd now like to make use of EMR 6.X.X releases via Data Pipelines, rather than EMR 5.X.X releases I'm currently using. This is for two main reasons:

  • Security compliance: the latest EMR 6.X.X release have less vulnerabilities than the latest EMR 5.X.X releases
  • Performance/functionality: EMR 6.X.X releases perform much better than EMR 5.X.X releases for what I'm doing, and have functionality I prefer to use

However...the current documentation for Data Pipeline says the following regarding EMR versions:

AWS Data Pipeline only supports release version 6.1.0 (emr-6.1.0).

Version 6.1.0 of EMR was last updated on Oct 15, 2020...it's pretty old.

Now, if I try and use an EMR version > 6.1.0 with Data Pipeline, I get the issue that has already been raised here i.e. during initial EMR cluster bring up via the Data Pipeline, there's a failure that renders the cluster unusable. It looks like a malformed attempt to create a symbolic link to a jar by one of the AWS scripts:

++ find /usr/lib/hive/lib/ -name 'opencsv*jar'
+ open_csv_jar='/usr/lib/hive/lib/opencsv-2.3.jar
/usr/lib/hive/lib/opencsv-3.9.jar'
+ sudo ln -s /usr/share/aws/emr/goodies/lib/emr-hadoop-goodies-3.2.0.jar /mnt/taskRunner/oncluster-emr-hadoop- goodies.jar
+ sudo ln -s /usr/share/aws/emr/goodies/lib/emr-hive-goodies-3.2.0.jar /mnt/taskRunner/oncluster-emr-hive-goodies.jar
+ sudo ln -s /usr/lib/hive/lib/opencsv-2.3.jar /usr/lib/hive/lib/opencsv-3.9.jar /mnt/taskRunner/open-csv.jar
ln: target ‘/mnt/taskRunner/open-csv.jar’ is not a directory
Command exiting with ret '1'

So - I guess my questions are:

  1. Is there a way to work around the above so that Data Pipelines can be used to launch EMR 6.5.0 clusters for Spark jobs?
  2. If there isn't, is there a different way of automating runs of EMR 6.5.0 clusters, other than writing my own script and scheduling that to bring up the EMR cluster and add the required jobs/steps?

Thanks.

asked 2 years ago331 views
1 Answer
0

With this banner in the console, "Please note that Data Pipeline service is in maintenance mode and we are not planning to expand the service to new regions. We plan to remove console access by 05/12/2023", I would focus development of new workloads in Managed Apache Airflow. Using this operator, you can kick off a dag that launches an EMR cluster and terminates it one job is complete, similar to what we used to do in data pipeline.

AWS
Eman
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions