EMR ON EKS - Libraries missing

0

Hello,

I have deployed an EMR on EKS and it works correctly. I have tested sending simple JOBs following the AWS guide: https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks-jobs-spark-sql-parameters.html

Subsequently, I deployed an Airflow environment from which I run a DAG to trigger a notebook running on the EMR virtual cluster. and so far everything works correctly,

The problem comes when I try to import a library such as pandas and it returns an error that it does not exist. I have tried to install the library in the pod to see if I can continue the process but as it says here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-install-libraries-and-kernels.html you can not install additional libraries.

Note: I have tried with the different releases of the EMR cluster, even the most recent one (6.8.0-latest).

I save the logs in CloudWatch from where I can see the following error message:

{"message":" import pandas as pd"}

{"message": "ModuleNotFoundError: No module named 'pandas'"}

This happens also with numpy but not with Boto3, for example.

Thank you in advance for your time

질문됨 2년 전276회 조회
1개 답변
0

Hello,

One option that you can use is a customizable image. You can package your dependencies in your custom image (the base image can be provided by the service). Then you can create your virtual cluster with this image. This blog demonstrates how you can do that. https://aws.amazon.com/blogs/aws/customize-and-package-dependencies-with-your-apache-spark-applications-on-amazon-emr-on-amazon-eks/

AWS
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인