Adding external python Libraries EMR Serverless Not recognized

0

Hi everyone.

I´m unable to add extra libraries to emr serverless. Here an example..I'm trying to simply add requests..I get error on job status failed: ModuleNotFoundError: No module named 'requests'. Please refer to user guide on how to use python libraries with EMR Serverless.

The python code is just: import sys import os import requests

print(sys.executable) print(sys.version)

So.. My job config:

--conf spark.archives=s3://XXXX/emrlib.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python

My emr application is : emr-6.9.0 with arm64

Note: -I see that that emr-6.9.0 is using python 3.7.16 and I generate enviroment (tar.gz file) using python 3.10..(current and only python available on ec2 yum) -Inside tar file y see all my modules added like requests

Could anybody give me some advise?

asked a year ago1664 views
4 Answers
0

Hi Krishnadas

Here the step to create enviroment using linux:

[ec2-user@ip-172-31-81-196 ~]$ python3 -m venv emrlib [ec2-user@ip-172-31-81-196 ~]$ source emrlib/bin/activate (emrlib) [ec2-user@ip-172-31-81-196 ~]$ pip3 install upgrade pip ERROR: Could not find a version that satisfies the requirement upgrade (from versions: none) ERROR: No matching distribution found for upgrade WARNING: You are using pip version 21.3.1; however, version 23.1 is available. You should consider upgrading via the '/home/ec2-user/emrlib/bin/python3 -m pip install --upgrade pip' command. (emrlib) [ec2-user@ip-172-31-81-196 ~]$ pip3 install requests Collecting requests Downloading requests-2.28.2-py3-none-any.whl (62 kB) |████████████████████████████████| 62 kB 1.9 MB/s Collecting idna<4,>=2.5 Downloading idna-3.4-py3-none-any.whl (61 kB) |████████████████████████████████| 61 kB 192 kB/s Collecting certifi>=2017.4.17 Downloading certifi-2022.12.7-py3-none-any.whl (155 kB) |████████████████████████████████| 155 kB 45.2 MB/s Collecting charset-normalizer<4,>=2 Downloading charset_normalizer-3.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (199 kB) |████████████████████████████████| 199 kB 51.2 MB/s Collecting urllib3<1.27,>=1.21.1 Downloading urllib3-1.26.15-py2.py3-none-any.whl (140 kB) |████████████████████████████████| 140 kB 66.5 MB/s Installing collected packages: urllib3, idna, charset-normalizer, certifi, requests Successfully installed certifi-2022.12.7 charset-normalizer-3.1.0 idna-3.4 requests-2.28.2 urllib3-1.26.15 WARNING: You are using pip version 21.3.1; however, version 23.1 is available. You should consider upgrading via the '/home/ec2-user/emrlib/bin/python3 -m pip install --upgrade pip' command. (emrlib) [ec2-user@ip-172-31-81-196 ~]$ pip3 install --upgrade pip Requirement already satisfied: pip in ./emrlib/lib/python3.9/site-packages (21.3.1) Collecting pip Downloading pip-23.1-py3-none-any.whl (2.1 MB) |████████████████████████████████| 2.1 MB 5.8 MB/s Installing collected packages: pip Attempting uninstall: pip Found existing installation: pip 21.3.1 Uninstalling pip-21.3.1: Successfully uninstalled pip-21.3.1 Successfully installed pip-23.1 (emrlib) [ec2-user@ip-172-31-81-196 ~]$ pip3 install requests Requirement already satisfied: requests in ./emrlib/lib/python3.9/site-packages (2.28.2) Requirement already satisfied: charset-normalizer<4,>=2 in ./emrlib/lib/python3.9/site-packages (from requests) (3.1.0) Requirement already satisfied: idna<4,>=2.5 in ./emrlib/lib/python3.9/site-packages (from requests) (3.4) Requirement already satisfied: urllib3<1.27,>=1.21.1 in ./emrlib/lib/python3.9/site-packages (from requests) (1.26.15) Requirement already satisfied: certifi>=2017.4.17 in ./emrlib/lib/python3.9/site-packages (from requests) (2022.12.7) (emrlib) [ec2-user@ip-172-31-81-196 ~]$ pip3 install venv-pack Collecting venv-pack Downloading venv_pack-0.2.0-py2.py3-none-any.whl (16 kB) Installing collected packages: venv-pack Successfully installed venv-pack-0.2.0 (emrlib) [ec2-user@ip-172-31-81-196 ~]$ venv-pack -f -o emrlib.tar.gz Collecting packages... Packing environment at '/home/ec2-user/emrlib' to 'emrlib.tar.gz' [########################################] | 100% Completed | 0.8s (emrlib) [ec2-user@ip-172-31-81-196 ~]$ ls Python-3.7.16 Python-3.7.16.tar.xz emrlib emrlib.tar.gz (emrlib) [ec2-user@ip-172-31-81-196 ~]$ aws s3 cp emrlib.tar.gz s3://ermworkspace/ upload failed: ./emrlib.tar.gz to s3://ermworkspace/emrlib.tar.gz Unable to locate credentials (emrlib) [ec2-user@ip-172-31-81-196 ~]$ aws s3 cp emrlib.tar.gz s3://ermworkspace/ upload: ./emrlib.tar.gz to s3://ermworkspace/emrlib.tar.gz


next I copied tar file to s3 bucket used by emr **** next spark properties

--conf spark.archives=s3://erm-dlk-workspace-management/emrlib.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python

**Note

Err Log

Unpacking an archive s3://erm-dlk-workspace-management/emrlib.tar.gz#environment from /tmp/spark-230814cb-00a0-458b-ba59-1ada05b4828d/emrlib.tar.gz to /home/hadoop/./environment 23/04/15 17:29:42 INFO ShutdownHookManager: Shutdown hook called 23/04/15 17:29:42 INFO ShutdownHookManager: Deleting directory /tmp/spark-230814cb-00a0-458b-ba59-1ada05b4828d

Result: ModuleNotFoundError: No module named 'requests'. Please refer to user guide on how to use python libraries with EMR Serverless.

Do you need aditional data?

answered a year ago
0

Hello,

Thank you for raising this question on re:Post. I understand that you are getting ModuleNotFoundError when you run your example.

To reproduce the issue I followed the steps described in the doc for Using Python libraries. In addition to the libraries mentioned in the doc, I had also installed requests package to use the same test as you shared above.

import sys
import os
import requests

print(sys.executable)
print(sys.version)
print(requests.__version__)

I used a separate folder and followed the steps to create the custom .tar.gz package from that directory so that I can deactivate this venv and do cleanup easily. I would recommend you to run a test with a new folder and venv if you have not done so.

Once I have prepared and uploaded the *.tar.gz file to my S3 bucket I used the command below to run the same in my EMR Serverless application.

aws emr-serverless start-job-run \
    --application-id application-id \
    --execution-role-arn job-role-arn \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://<bucket>/<prefix>/pyexample.py",
            "entryPointArguments": [],
            "sparkSubmitParameters": "--conf spark.executor.cores=1 --conf spark.executor.memory=4g --conf spark.driver.cores=1 --conf spark.driver.memory=4g --conf spark.executor.instances=1 --conf spark.archives=s3://<bucket>/<prefix>/pyspark_venv.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python"
        }
    }'

The JobRun succeeded in my test.

I wanted to make sure it is not related to different python version, so tested using a different python version as documented here and followed the steps to prepare the package and used the same test above but with the new python package i.e. spark.archives=s3://<bucket>/<prefix>/pyspark_venv_python_3.9.9.tar.gz#environment

This time also my JobRun succeeded.

Can you please do a quick test following the exact step in the docs 1 or 2? If it does work for you, then you can trace the steps you followed to understand what you may have done differently that caused the issue.

Please reply to this if you have additional questions and I'll be happy to reply to you as soon as I can.

Cheers.

AWS
SUPPORT ENGINEER
answered a year ago
0

Hi dacort ,Krishnadas

After following both instruction (python libs and custom python) the same error..I wonder if my tar environment is Ok.. I tried installing customer python 3.7.16 (used y emr serverless 6.9) and pack environment.. no result (same error)

Maybe if you share me a tar environment file I could check if works in EMR Serverless

Thanks

answered a year ago
0

Hello,

Thank you for replying. It is hard to troubleshoot without the specific information, I would highly recommend raising a case with us so that one of our engineer can assist you in resolving this issue quicker.

In my tests I had tried the default version of python available in Amazon Linux 2 as well as 3.10.x version as well, in both cases it worked for me. I believe it is not the python version instead the tar package creation which may be causing issues for you, but I cannot be certain at this.

AWS
SUPPORT ENGINEER
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions