EMR ON EKS - Libraries missing

0

Hello,

I have deployed an EMR on EKS and it works correctly. I have tested sending simple JOBs following the AWS guide: https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks-jobs-spark-sql-parameters.html

Subsequently, I deployed an Airflow environment from which I run a DAG to trigger a notebook running on the EMR virtual cluster. and so far everything works correctly,

The problem comes when I try to import a library such as pandas and it returns an error that it does not exist. I have tried to install the library in the pod to see if I can continue the process but as it says here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-install-libraries-and-kernels.html you can not install additional libraries.

Note: I have tried with the different releases of the EMR cluster, even the most recent one (6.8.0-latest).

I save the logs in CloudWatch from where I can see the following error message:

{"message":" import pandas as pd"}

{"message": "ModuleNotFoundError: No module named 'pandas'"}

This happens also with numpy but not with Boto3, for example.

Thank you in advance for your time

feita há 2 anos276 visualizações
1 Resposta
0

Hello,

One option that you can use is a customizable image. You can package your dependencies in your custom image (the base image can be provided by the service). Then you can create your virtual cluster with this image. This blog demonstrates how you can do that. https://aws.amazon.com/blogs/aws/customize-and-package-dependencies-with-your-apache-spark-applications-on-amazon-emr-on-amazon-eks/

AWS
respondido há um ano

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas