EMR ON EKS - Libraries missing

0

Hello,

I have deployed an EMR on EKS and it works correctly. I have tested sending simple JOBs following the AWS guide: https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks-jobs-spark-sql-parameters.html

Subsequently, I deployed an Airflow environment from which I run a DAG to trigger a notebook running on the EMR virtual cluster. and so far everything works correctly,

The problem comes when I try to import a library such as pandas and it returns an error that it does not exist. I have tried to install the library in the pod to see if I can continue the process but as it says here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-install-libraries-and-kernels.html you can not install additional libraries.

Note: I have tried with the different releases of the EMR cluster, even the most recent one (6.8.0-latest).

I save the logs in CloudWatch from where I can see the following error message:

{"message":" import pandas as pd"}

{"message": "ModuleNotFoundError: No module named 'pandas'"}

This happens also with numpy but not with Boto3, for example.

Thank you in advance for your time

質問済み 2年前276ビュー
1回答
0

Hello,

One option that you can use is a customizable image. You can package your dependencies in your custom image (the base image can be provided by the service). Then you can create your virtual cluster with this image. This blog demonstrates how you can do that. https://aws.amazon.com/blogs/aws/customize-and-package-dependencies-with-your-apache-spark-applications-on-amazon-emr-on-amazon-eks/

AWS
回答済み 1年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン