Can Data Pipelines be used for running Spark jobs on EMR 6.5.0?

0

Hi,

I have a problem in that I make heavy use of EMRs, and I orchestrate their use with Data Pipeline - multiple daily runs are automated and EMRs are launched and terminated on conclusion.

However, I'd now like to make use of EMR 6.X.X releases via Data Pipelines, rather than EMR 5.X.X releases I'm currently using. This is for two main reasons:

  • Security compliance: the latest EMR 6.X.X release have less vulnerabilities than the latest EMR 5.X.X releases
  • Performance/functionality: EMR 6.X.X releases perform much better than EMR 5.X.X releases for what I'm doing, and have functionality I prefer to use

However...the current documentation for Data Pipeline says the following regarding EMR versions:

AWS Data Pipeline only supports release version 6.1.0 (emr-6.1.0).

Version 6.1.0 of EMR was last updated on Oct 15, 2020...it's pretty old.

Now, if I try and use an EMR version > 6.1.0 with Data Pipeline, I get the issue that has already been raised here i.e. during initial EMR cluster bring up via the Data Pipeline, there's a failure that renders the cluster unusable. It looks like a malformed attempt to create a symbolic link to a jar by one of the AWS scripts:

++ find /usr/lib/hive/lib/ -name 'opencsv*jar'
+ open_csv_jar='/usr/lib/hive/lib/opencsv-2.3.jar
/usr/lib/hive/lib/opencsv-3.9.jar'
+ sudo ln -s /usr/share/aws/emr/goodies/lib/emr-hadoop-goodies-3.2.0.jar /mnt/taskRunner/oncluster-emr-hadoop- goodies.jar
+ sudo ln -s /usr/share/aws/emr/goodies/lib/emr-hive-goodies-3.2.0.jar /mnt/taskRunner/oncluster-emr-hive-goodies.jar
+ sudo ln -s /usr/lib/hive/lib/opencsv-2.3.jar /usr/lib/hive/lib/opencsv-3.9.jar /mnt/taskRunner/open-csv.jar
ln: target ‘/mnt/taskRunner/open-csv.jar’ is not a directory
Command exiting with ret '1'

So - I guess my questions are:

  1. Is there a way to work around the above so that Data Pipelines can be used to launch EMR 6.5.0 clusters for Spark jobs?
  2. If there isn't, is there a different way of automating runs of EMR 6.5.0 clusters, other than writing my own script and scheduling that to bring up the EMR cluster and add the required jobs/steps?

Thanks.

질문됨 2년 전397회 조회
1개 답변
0

With this banner in the console, "Please note that Data Pipeline service is in maintenance mode and we are not planning to expand the service to new regions. We plan to remove console access by 05/12/2023", I would focus development of new workloads in Managed Apache Airflow. Using this operator, you can kick off a dag that launches an EMR cluster and terminates it one job is complete, similar to what we used to do in data pipeline.

AWS
Eman
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인