- 最新
- 投票最多
- 评论最多
Hello,
There could be a couple of reasons this issue would have occurred.
- Your EMR-S does not network access to either download the package to workers or requirement file not shipped to all the workers.
- There could be version mismatched(python environment).
- I see on-going bug in supporting virtualenv(spark.pyspark.virtualenv.requirements) in pyspark which is aligned in your case. - https://issues.apache.org/jira/browse/SPARK-13587
Personally, I havent explored the above method much. I recommend building the python virtual environment and use either --archives or --py-files while submitting the spark application.
--conf spark.archives=s3://DOC-EXAMPLE-BUCKET/EXAMPLE-PREFIX/pyspark_venv.tar.gz#environment
--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python
--conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python
--conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python
Please make sure the python version are matched. Different Python versions than the version packaged in the Amazon EMR release for your Amazon EMR Serverless application and build a Python virtual environment with the Python version you want to use.
The python version used to create a virtual environment must be the same as the python version expected in EMR Serverless. If you are using EMR version 6.15.0 or lower, please use Amazon Linux 2 AMI.
Hello,
I understand that you are importing custom python modules in the EMR serverless job and failing with errors.
If you are trying to import just a single .py file in addition to your entrypoint .py file, you need to upload both files to s3 and provide the non-entrypoint file to the spark submit properties using --conf spark.submit.Pyfiles=s3://bucket/name/spark_reader.py as is mentioned in the link[1].
If you have a more "complex" module structure (like additional .py files in a directory) you need to zip that up and upload it and use pyFiles again to point to the zip. Can you try following the exact steps in the docs [1] or [2] and see if it helps.
References: [1] https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python-libraries.html [2] https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python.html [3] https://repost.aws/questions/QUtba0RiEGSfyNa8DSWMDE3Q/adding-external-python-libraries-emr-serverless-not-recognized
if you still need, you can raise a case with us so that one of our engineer can assist you in resolving this issue quicker.
Hello Ramya, thank you for the detailed response. I would really appreciate help on one of your engineers since I'm fairly new to EMR as well and a helping hand is greatly needed. How do I raise I case?
https://docs.aws.amazon.com/awssupport/latest/user/case-management.html#describing-your-problem - You can open a support case with this step. Please make sure to open the case in the same account where you created the application.
相关内容
- AWS 官方已更新 1 年前
- AWS 官方已更新 4 个月前
- AWS 官方已更新 1 年前
Thank you for the insightful comment. I'll be checking your instructions. Since I'm using EMR 7.0 and it requires Python 3.9, does that mean I need to create the venv using the same version? Will creating the venv in my local machine affect it?
Should it also affect this configuration - "spark.jars": "s3://bi-emr-2024/mysql-connector-j-8.3.0.jar" since it is related to network access?
- Basically it download the jar to each workers during the execution, it requires to configure the EMR-S to access and get the files from s3. If your EMR-S is in private subnet, then you have to make sure they are downloadable - https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/vpc-access.htmlHow could I be using a mismatch?
-- I meant your venv and the EMR-S version would be mismatched. If this is not the case, then it's good.I followed the instructions here https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python.html but I'm receiving errors which I will put in the next comment.
Could not find platform independent libraries <prefix> Could not find platform dependent libraries <exec_prefix> Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>] Python path configuration: PYTHONHOME = (not set) PYTHONPATH = '/tmp/localPyFiles-f8f87172-2600-4ba4-850a-2d22f91aa629:/usr/lib/spark/python/lib/pyspark.zip:/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip:/usr/lib/spark/jars/spark-core_2.12-3.5.0-amzn-0.jar' program name = './environment/bin/python' isolated = 0 environment = 1 user site = 1 import site = 1 sys._base_executable = '/home/hadoop/environment/bin/python' sys.base_prefix = '/usr/local' sys.base_exec_prefix = '/usr/local' sys.platlibdir = 'lib' sys.executable = '/home/hadoop/environment/bin/python' sys.prefix = '/usr/local' sys.exec_prefix = '/usr/local' sys.path = [ '/tmp/localPyFiles-f8f87172-2600-4ba4-850a-2d22f91aa629', '/usr/lib/spark/python/lib/pyspark.zip', '/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip', '/usr/lib/spark/jars/spark-core_2.12-3.5.0-amzn-0.jar', '/usr/local/lib/python39.zip', '/usr/local/lib/python3.9', '/usr/local/lib/lib-dynload', ] Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding Python runtime state: core initialized ModuleNotFoundError: No module named 'encodings'
Current thread 0x00007f669c752740 (most recent call first): <no Python frame>