When I try to import extra modules or packages using the AWS Glue Python shell, I get an "ImportError: No module named" response. For example:
ImportError: No module named pyarrow.compat
Short description
The AWS Glue Python shell uses .egg and .whl files. Python can import directly from a .egg or .whl file. To maintain compatibility, be sure that your local build environment uses the same Python version as the Python shell job. For example, if you build a .egg file with Python 3, use Python 3 for the AWS Glue Python shell job.
Note: Starting June 1, 2022, Python shell jobs support only Python 3. For more information, see AWS Glue version support policy.
Resolution
1. Create the setup.py file and add the install_requires parameter to list the modules that you want to import:
from setuptools import setup
setup(
name="redshift_module",
version="0.1",
packages=['redshift_module'],
install_requires=['pyarrow','pandas','numpy','fastparquet']
)
2. Create a folder named reshift_module under the current directory:
$ mkdir redshift_module
Then, install the packages:
$ python setup.py develop
Example output:
running develop
running egg_info
writing requirements to redshift_module.egg-info/requires.txt
writing redshift_module.egg-info/PKG-INFO
writing top-level names to redshift_module.egg-info/top_level.txt
writing dependency_links to redshift_module.egg-info/dependency_links.txt
reading manifest file 'redshift_module.egg-info/SOURCES.txt'
writing manifest file 'redshift_module.egg-info/SOURCES.txt'
running build_ext
Creating /usr/local/lib/python3.6/site-packages/redshift-module.egg-link (link to .)
redshift-module 0.1 is already the active version in easy-install.pth
Using /Users/test/Library/Python/3.6/lib/python/site-packages
Searching for pandas==0.24.2
Best match: pandas 0.24.2
Adding pandas 0.24.2 to easy-install.pth file
Using /usr/local/lib/python3.6/site-packages
Searching for pyarrow==0.12.1
Best match: pyarrow 0.12.1
Adding pyarrow 0.12.1 to easy-install.pth file
Installing plasma_store script to /usr/local/bin
3. Do one of the following:
Create a .egg file:
python setup.py bdist_egg
-or- Create a .whl file:
python setup.py bdist_wheel
5. Copy the .egg or .whl file from the dist folder to an Amazon Simple Storage Service (Amazon S3) bucket. For more information, see Providing your own Python library. Example:
dist aws s3 cp MOA_EDM_cdc_controller_g2-0.2.9-py3-none-any.whl s3://doc-example-bucket/glue-libs/python-shell-jobs/
upload: ./MOA_EDM_cdc_controller_g2-0.2.9-py3-none-any.whl to s3://doc-example-bucket/glue-libs/python-shell-jobs/MOA_EDM_cdc_controller_g2-0.2.9-py3-none-any.whl
Note: If you receive errors when running AWS Command Line Interface (AWS CLI) commands, make sure that you’re using the most recent AWS CLI version.
6. The module is now installed in your Python shell job. To confirm, check the Amazon CloudWatch Logs group for Python shell jobs (/aws-glue/python-jobs/output). Here's an example of a successful output:
Searching for pyarrow
Reading https://pypi.python.org/simple/pyarrow/
Downloading https://files.pythonhosted.org/packages/fe/3b/267c0fdb3dc5ad7989417cfb447fbcbec008bafc1bb26d4f0221c5e4e508/pyarrow-0.12.1-cp27-cp27mu-manylinux1_x86_64.whl#sha256=63170571cccaf0bf01a1d30eacc4d9274bd5c4f448c2b5b1a4ddc125952f4284
Best match: pyarrow 0.12.1
Processing pyarrow-0.12.1-cp27-cp27mu-manylinux1_x86_64.whl
Installing pyarrow-0.12.1-cp27-cp27mu-manylinux1_x86_64.whl to /glue/lib/installation
writing requirements to /glue/lib/installation/pyarrow-0.12.1-py3.6-linux-x86_64.egg/EGG-INFO/requires.txt
Adding pyarrow 0.12.1 to easy-install.pth file
Installing plasma_store script to /glue/lib/installation
Installed /glue/lib/installation/pyarrow-0.12.1-py3.6-linux-x86_64.egg
Related information
How do I use external Python libraries in my AWS Glue 1.0 or 0.9 ETL job?
How do I use external Python libraries in my AWS Glue 2.0 ETL job?