EMR spark no module named

0

Hi,

I'm trying to run a python job on EMR with some dependencies installed with venv as following

python -m venv pyspark_venv
source pyspark_venv/bin/activate
pip install pyarrow pandas venv-pack
venv-pack -o pyspark_venv.tar.gz

and the job runs with following conf

{
   "spark.yarn.appMasterEnv.PYSPARK_PYTHON":"./environment/bin/python",
   "spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON":"./environment/bin/python",
   "spark.yarn.dist.archives":"s3://DOC-EXAMPLE-BUCKET/prefix/my_pyspark_venv.tar.gz#environment",
   "spark.submit.deployMode":"cluster"
}

but when I run the job I get no module named Pandas. Locally the script runs correctly with the same venv and printing sys path seems that spark is using the system python instead of venv

Any idea on the conf to apply to use venv? Thanks

1개 답변
1
수락된 답변

It looks like your EMR Spark job is not able to find the packages installed in your virtual environment. To ensure that Spark is using the Python environment in your virtual environment, you can try the following:

  1. Add the following line to your EMR Spark job configuration to ensure that Spark uses the Python binary from your virtual environment:
"spark.executorEnv.PYTHONHASHSEED":"0"
  1. In your PySpark code, add the following lines to explicitly set the Python environment to use:
import os
os.environ['PYSPARK_PYTHON'] = './environment/bin/python'
os.environ['PYSPARK_DRIVER_PYTHON'] = './environment/bin/python'
  1. Make sure that the pyspark_venv.tar.gz file is uploaded to your S3 bucket with read permissions.
  2. Verify that the virtual environment is successfully extracted by checking the logs in the yarn/userlogs directory.
hash
답변함 일 년 전
AWS
지원 엔지니어
검토됨 한 달 전
  • Thanks for the answer, after further test in the end was the version of python not compatible

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠