Import Custom Python Modules on EMR Serverless through Spark Configuration

0

Hello everyone,

I created a spark_ready.py module that hosts multiple classes that I want to use as a template. I've seen in multiple configurations online that using the "spark.submit.pyFiles" will allow you to import and the directory has init.py. However, I'm still being returned an error when I'm trying to "import spark_reader". I had the assumption that it will work since Spark is configuring it.

My goal is to use the ERM Serverless application also in SageMaker that's why I want to solve this dependency management issue. Have any of you solved this by modifying the runtimeConfiguration?

{ "runtimeConfiguration": [ { "classification": "spark-defaults", "configurations": null, "properties": { "spark.pyspark.virtualenv.requirements": "s3://bi-emr-2024/venv/requirements.txt", "spark.submit.pyFiles": "s3://bi-emr-2024/emr-serverless-workspaces/modules/spark_reader.py", "spark.pyspark.virtualenv.requirements.use": "true", "spark.pyspark.virtualenv.type": "native", "spark.pyspark.virtualenv.enabled": "true", "spark.pyspark.virtualenv.bin.path": "/usr/bin/virtualenv", "spark.sql.shuffle.partitions": "100", "spark.pyspark.python": "python", "spark.log.level": "DEBUG", "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.jars": "s3://bi-emr-2024/mysql-connector-j-8.3.0.jar" } } ] }

2 Answers
2

Hello,

There could be a couple of reasons this issue would have occurred.

  1. Your EMR-S does not network access to either download the package to workers or requirement file not shipped to all the workers.
  2. There could be version mismatched(python environment).
  3. I see on-going bug in supporting virtualenv(spark.pyspark.virtualenv.requirements) in pyspark which is aligned in your case. - https://issues.apache.org/jira/browse/SPARK-13587

Personally, I havent explored the above method much. I recommend building the python virtual environment and use either --archives or --py-files while submitting the spark application.

--conf spark.archives=s3://DOC-EXAMPLE-BUCKET/EXAMPLE-PREFIX/pyspark_venv.tar.gz#environment 
--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python
--conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python 
--conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python

Please make sure the python version are matched. Different Python versions than the version packaged in the Amazon EMR release for your Amazon EMR Serverless application and build a Python virtual environment with the Python version you want to use.

The python version used to create a virtual environment must be the same as the python version expected in EMR Serverless. If you are using EMR version 6.15.0 or lower, please use Amazon Linux 2 AMI.

AWS
SUPPORT ENGINEER
answered a month ago
  • Thank you for the insightful comment. I'll be checking your instructions. Since I'm using EMR 7.0 and it requires Python 3.9, does that mean I need to create the venv using the same version? Will creating the venv in my local machine affect it?

    1. Your EMR-S does not network access to either download the package to workers or requirement file not shipped to all the workers.
    • Should it also affect this configuration - "spark.jars": "s3://bi-emr-2024/mysql-connector-j-8.3.0.jar" since it is related to network access?
    1. There could be version mismatched(python environment).
    • How could I be using a mismatch? I'm using the notebooks in EMR to test the modules then transferring them to a .py script.
    1. I see on-going bug in supporting virtualenv(spark.pyspark.virtualenv.requirements) in pyspark which is aligned in your case. - https://issues.apache.org/jira/browse/SPARK-13587
    • I see. I'll check on this.
  • Should it also affect this configuration - "spark.jars": "s3://bi-emr-2024/mysql-connector-j-8.3.0.jar" since it is related to network access? - Basically it download the jar to each workers during the execution, it requires to configure the EMR-S to access and get the files from s3. If your EMR-S is in private subnet, then you have to make sure they are downloadable - https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/vpc-access.html

    How could I be using a mismatch? -- I meant your venv and the EMR-S version would be mismatched. If this is not the case, then it's good.

  • I followed the instructions here https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python.html but I'm receiving errors which I will put in the next comment.

  • Could not find platform independent libraries <prefix> Could not find platform dependent libraries <exec_prefix> Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>] Python path configuration: PYTHONHOME = (not set) PYTHONPATH = '/tmp/localPyFiles-f8f87172-2600-4ba4-850a-2d22f91aa629:/usr/lib/spark/python/lib/pyspark.zip:/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip:/usr/lib/spark/jars/spark-core_2.12-3.5.0-amzn-0.jar' program name = './environment/bin/python' isolated = 0 environment = 1 user site = 1 import site = 1 sys._base_executable = '/home/hadoop/environment/bin/python' sys.base_prefix = '/usr/local' sys.base_exec_prefix = '/usr/local' sys.platlibdir = 'lib' sys.executable = '/home/hadoop/environment/bin/python' sys.prefix = '/usr/local' sys.exec_prefix = '/usr/local' sys.path = [ '/tmp/localPyFiles-f8f87172-2600-4ba4-850a-2d22f91aa629', '/usr/lib/spark/python/lib/pyspark.zip', '/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip', '/usr/lib/spark/jars/spark-core_2.12-3.5.0-amzn-0.jar', '/usr/local/lib/python39.zip', '/usr/local/lib/python3.9', '/usr/local/lib/lib-dynload', ] Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding Python runtime state: core initialized ModuleNotFoundError: No module named 'encodings'

    Current thread 0x00007f669c752740 (most recent call first): <no Python frame>

0

Hello,

I understand that you are importing custom python modules in the EMR serverless job and failing with errors.

If you are trying to import just a single .py file in addition to your entrypoint .py file, you need to upload both files to s3 and provide the non-entrypoint file to the spark submit properties using --conf spark.submit.Pyfiles=s3://bucket/name/spark_reader.py as is mentioned in the link[1].

If you have a more "complex" module structure (like additional .py files in a directory) you need to zip that up and upload it and use pyFiles again to point to the zip. Can you try following the exact steps in the docs [1] or [2] and see if it helps.

References: [1] https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python-libraries.html [2] https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python.html [3] https://repost.aws/questions/QUtba0RiEGSfyNa8DSWMDE3Q/adding-external-python-libraries-emr-serverless-not-recognized

if you still need, you can raise a case with us so that one of our engineer can assist you in resolving this issue quicker.

AWS
SUPPORT ENGINEER
answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions