Knowledge Center Monthly Newsletter - March 2025
Stay up to date with the latest from the Knowledge Center. See all new and updated Knowledge Center articles published in the last month and re:Post’s top contributors.
How do I use external Python libraries in my AWS Glue ETL job?
I want to use external Python libraries in an AWS Glue extract, transform, and load (ETL) job.
Short description
When you use AWS Glue versions 2.0, 3.0, and 4.0, you can install additional Python modules or different module versions at the job level. To add a new module or change the version of an existing module, use the --additional-python-modules job parameter key. The key's value is a list of comma-separated Python module names. When you use this parameter, your AWS Glue ETL job installs the additional modules through the Python package installer (pip3).
You can also use the --additional-python-modules parameter to install Python libraries that are written in C-based languages.
Resolution
Install or update Python modules
To install an additional Python module for your AWS Glue job, complete the following steps:
- Open the AWS Glue console.
- In the navigation pane, Choose Jobs.
- Select the job where you want to add the Python module.
- Choose Actions, and then choose Edit job.
- Expand the Security configuration, script libraries, and job parameters (optional) section.
- Under Job parameters, do the following:
For Key, enter --additional-python-modules.
For Value, enter a comma-separated list of modules that you want to add. - Choose Save.
For example, suppose that you want to add two new modules, version 1.0.2 of PyMySQL and version 3.6.2 of the Natural Language Toolkit (NLTK). You install the PyMySQL module from the internet and the NLTK module from an Amazon Simple Storage Service (Amazon S3) bucket. In that case, the --additional-python-modules parameter key has the value pymysql==1.0.2, s3://aws-glue-add-modules/nltk-3.6.2-py3-none-any.whl.
Some modules have dependencies on other modules. If you install or update such a module, then you must also download the other modules that it depends on. This means that you must have internet access to install or update the module. If you don't have internet access, then see Building Python modules from a wheel for Spark ETL workloads using AWS Glue 2.0.
For a list of Python modules that are included in each AWS Glue version by default, see Python modules already provided in AWS Glue.
Install C-based Python modules
Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.
AWS Glue also supports libraries and extensions written in C with the --additional-python-modules parameter. However, some Python modules, such as spacy and grpc, require root permissions to install. AWS Glue doesn't provide root access during package installation. To resolve this issue, precompile the binaries into a wheel compatible with AWS Glue and install that wheel.
To compile a library in a C-based language, the compiler must be compatible with the target operating system and processor architecture. If the library is compiled against a different operating system or processor architecture, then the wheel isn't installed in AWS Glue. Because AWS Glue is a managed service, cluster access isn't available to develop these dependencies.
To precompile a C-based Python module that requires root permissions, complete the following steps:
-
Launch an Amazon Elastic Compute Cloud (Amazon EC2) Linux instance with enough volume space for your libraries.
-
Install Docker on the EC2 instance, set up non-sudo access, and then start Docker. To do so, run the following commands:
Install Docker:
sudo yum install docker -y
Set up non-sudo access:
sudo usermod -a -G docker ec2-user
Start Docker:
sudo service docker start
-
Create a Dockerfile file for the module. For example, to install the grpcio module, create a file called dockerfile_grpcio and copy the following content into the file:
\# Base for AWS Glue FROM amazonlinux RUN yum update -y RUN yum install shadow-utils.x86\_64 -y RUN yum install -y java-1.8.0-openjdk.x86\_64 RUN yum install -y python3 RUN yum install -y cython doxygen numpy scipy gcc autoconf automake libtool zlib-devel openssl-devel maven wget protobuf-compiler cmake make gcc-c++ # Additional components needed for grpcio WORKDIR /root RUN yum install python3-devel -y RUN yum install python-devel -y RUN pip3 install wheel # Install grpcio and related modules RUN pip3 install Cython RUN pip3 install cmake scikit-build RUN pip3 install grpcio # Create a directory for the wheel RUN mkdir wheel\_dir # Create the wheel RUN pip3 wheel grpcio -w wheel\_dir
-
Run the docker build to build your Dockerfile:
docker build -f dockerfile\_grpcio .
-
Restart the Docker daemon:
sudo service docker restart
When the docker build command completes, you get a success message that contains your Docker image ID. For example, "Successfully built 1111222233334444". Note the Docker image ID to use in the next step.
-
Extract the .whl wheel file from the Docker container. To do so, run the following commands:
Get the Docker image ID:
docker image ls
Run the container, but replace 1111222233334444 with your Docker image ID:
docker run -dit 111122223334444
Verify the location of the wheel file and retrieve the name of the wheel file, but replace 5555666677778888 with your container ID:
docker exec -t -i 5555666677778888 ls /root/wheel\_dir/
Copy the wheel from the Docker container to Amazon EC2:
docker cp 5555666677778888:/root/wheel\_dir/doc-example-wheel .
Note: Replace doc-example-wheel with the name of your generated wheel file
-
To upload the wheel to Amazon S3, run the following commands:
aws s3 cp doc-example-wheel s3://path/to/wheel/
aws s3 cp grpcio-1.32.0-cp37-cp37m-linux\_x86\_64.whl s3://aws-glue-add-modules/grpcio/
Note: Replace grpcio-1.32.0-cp37-cp37m-linux_x86_64.whl with the name of your Python package file.
-
Open the AWS Glue console.
-
For the AWS Glue ETL job, under Job parameters, enter the following:
For Key, enter --additional-python-modules.
For Value, enter s3://aws-glue-add-modules/grpcio/grpcio-1.32.0-cp37-cp37m-linux_x86_64.whl.
Related information
Relevant content
- asked 3 years agolg...
- asked 2 years agolg...
- asked 3 years agolg...
- asked a year agolg...
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 2 years ago
- How do I install and troubleshoot Python libraries in Amazon EMR and Amazon EMR Serverless clusters?AWS OFFICIALUpdated 6 months ago
- AWS OFFICIALUpdated a year ago