Import Custom Python Modules on EMR Serverless through Spark Configuration

Question

Hello everyone,

I created a spark_ready.py module that hosts multiple classes that I want to use as a template. I've seen in multiple configurations online that using the "spark.submit.pyFiles" will allow you to import and the directory has __init__.py. However, I'm still being returned an error when I'm trying to "import spark_reader". I had the assumption that it will work since Spark is configuring it.

My goal is to use the ERM Serverless application also in SageMaker that's why I want to solve this dependency management issue. Have any of you solved this by modifying the runtimeConfiguration?

{
  "runtimeConfiguration": [
    {
      "classification": "spark-defaults",
      "configurations": null,
      "properties": {
        "spark.pyspark.virtualenv.requirements": "s3://bi-emr-2024/venv/requirements.txt",
        "spark.submit.pyFiles": "s3://bi-emr-2024/emr-serverless-workspaces/modules/spark_reader.py",
        "spark.pyspark.virtualenv.requirements.use": "true",
        "spark.pyspark.virtualenv.type": "native",
        "spark.pyspark.virtualenv.enabled": "true",
        "spark.pyspark.virtualenv.bin.path": "/usr/bin/virtualenv",
        "spark.sql.shuffle.partitions": "100",
        "spark.pyspark.python": "python",
        "spark.log.level": "DEBUG",
        "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
        "spark.jars": "s3://bi-emr-2024/mysql-connector-j-8.3.0.jar"
      }
    }
  ]
}

Answer

Hello,

There could be a couple of reasons this issue would have occurred. 
1. Your EMR-S does not network access to either download the package to workers or requirement file not shipped to all the workers. 
2. There could be version mismatched(python environment). 
3. I see on-going bug in supporting virtualenv(spark.pyspark.virtualenv.requirements) in pyspark which is aligned in your case.  - https://issues.apache.org/jira/browse/SPARK-13587

Personally, I havent explored the above method much. I recommend building the **[python virtual environment](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python-libraries.html#building-python-virtual-env)** and use either **[--archives](https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html#using-virtualenv)** or **[--py-files](https://spark.apache.org/docs/latest/running-on-yarn.html#important-notes)** while submitting the spark application.

```
--conf spark.archives=s3://DOC-EXAMPLE-BUCKET/EXAMPLE-PREFIX/pyspark_venv.tar.gz#environment 
--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python
--conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python 
--conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python
```

Please make sure the python version are matched.  Different Python versions than the version packaged in the Amazon EMR release for your Amazon EMR Serverless application and build a Python virtual environment with the Python version you want to use.

The python version used to create a virtual environment must be the same as the python version expected in EMR Serverless. If you are using EMR version 6.15.0 or lower, please **[use Amazon Linux 2 AMI. ](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python.html)**

Answer

Hello,

I understand that you are importing custom python modules in the EMR serverless job and failing with errors.

If  you are trying to import just a single .py file in addition to your entrypoint .py file, you need to upload both files to s3 and provide the non-entrypoint file to the spark submit properties using --conf spark.submit.Pyfiles=s3://bucket/name/spark_reader.py as is mentioned in the link[1].

If you have a more "complex" module structure (like additional .py files in a directory) you need to zip that up and upload it and use pyFiles again to point to the zip.
Can you try following the exact steps in the docs [1] or [2] and see if it helps.

References:
[1] https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python-libraries.html
[2] https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python.html
[3] https://repost.aws/questions/QUtba0RiEGSfyNa8DSWMDE3Q/adding-external-python-libraries-emr-serverless-not-recognized

if you still need, you can raise a case with us so that one of our engineer can assist you in resolving this issue quicker.

Import Custom Python Modules on EMR Serverless through Spark Configuration

相关内容