EMR ON EKS - Libraries missing

0

Hello,

I have deployed an EMR on EKS and it works correctly. I have tested sending simple JOBs following the AWS guide: https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks-jobs-spark-sql-parameters.html

Subsequently, I deployed an Airflow environment from which I run a DAG to trigger a notebook running on the EMR virtual cluster. and so far everything works correctly,

The problem comes when I try to import a library such as pandas and it returns an error that it does not exist. I have tried to install the library in the pod to see if I can continue the process but as it says here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-install-libraries-and-kernels.html you can not install additional libraries.

Note: I have tried with the different releases of the EMR cluster, even the most recent one (6.8.0-latest).

I save the logs in CloudWatch from where I can see the following error message:

{"message":" import pandas as pd"}

{"message": "ModuleNotFoundError: No module named 'pandas'"}

This happens also with numpy but not with Boto3, for example.

Thank you in advance for your time

posta 2 anni fa276 visualizzazioni
1 Risposta
0

Hello,

One option that you can use is a customizable image. You can package your dependencies in your custom image (the base image can be provided by the service). Then you can create your virtual cluster with this image. This blog demonstrates how you can do that. https://aws.amazon.com/blogs/aws/customize-and-package-dependencies-with-your-apache-spark-applications-on-amazon-emr-on-amazon-eks/

AWS
con risposta un anno fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande