EMR ON EKS - Libraries missing

0

Hello,

I have deployed an EMR on EKS and it works correctly. I have tested sending simple JOBs following the AWS guide: https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks-jobs-spark-sql-parameters.html

Subsequently, I deployed an Airflow environment from which I run a DAG to trigger a notebook running on the EMR virtual cluster. and so far everything works correctly,

The problem comes when I try to import a library such as pandas and it returns an error that it does not exist. I have tried to install the library in the pod to see if I can continue the process but as it says here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-install-libraries-and-kernels.html you can not install additional libraries.

Note: I have tried with the different releases of the EMR cluster, even the most recent one (6.8.0-latest).

I save the logs in CloudWatch from where I can see the following error message:

{"message":" import pandas as pd"}

{"message": "ModuleNotFoundError: No module named 'pandas'"}

This happens also with numpy but not with Boto3, for example.

Thank you in advance for your time

已提問 2 年前檢視次數 276 次
1 個回答
0

Hello,

One option that you can use is a customizable image. You can package your dependencies in your custom image (the base image can be provided by the service). Then you can create your virtual cluster with this image. This blog demonstrates how you can do that. https://aws.amazon.com/blogs/aws/customize-and-package-dependencies-with-your-apache-spark-applications-on-amazon-emr-on-amazon-eks/

AWS
已回答 1 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南