How do I use external Python libraries in my AWS Glue ETL job?

5 minute read
0

I want to use external Python libraries in an AWS Glue extract, transform, and load (ETL) job.

Short description

When you use AWS Glue versions 2.0, 3.0, and 4.0, you can install additional Python modules or different module versions at the job level. To add a new module or change the version of an existing module, use the --additional-python-modules job parameter key. The key's value is a list of comma-separated Python module names. When you use this parameter, your AWS Glue ETL job installs the additional modules through the Python package installer (pip3).

You can also use the --additional-python-modules parameter to install Python libraries that are written in C-based languages.

Resolution

Install or update Python modules

To install an additional Python module for your AWS Glue job, complete the following steps:

  1. Open the AWS Glue console.
  2. In the navigation pane, Choose Jobs.
  3. Select the job where you want to add the Python module.
  4. Choose Actions, and then choose Edit job.
  5. Expand the Security configuration, script libraries, and job parameters (optional) section.
  6. Under Job parameters, do the following:
    For Key, enter --additional-python-modules.
    For Value, enter a comma-separated list of modules that you want to add.
  7. Choose Save.

For example, suppose that you want to add two new modules, version 1.0.2 of PyMySQL and version 3.6.2 of the Natural Language Toolkit (NLTK). You install the PyMySQL module from the internet and the NLTK module from an Amazon Simple Storage Service (Amazon S3) bucket. In that case, the --additional-python-modules parameter key has the value pymysql==1.0.2, s3://aws-glue-add-modules/nltk-3.6.2-py3-none-any.whl.

Some modules have dependencies on other modules. If you install or update such a module, then you must also download the other modules that it depends on. This means that you must have internet access to install or update the module. If you don't have internet access, then see Building Python modules from a wheel for Spark ETL workloads using AWS Glue 2.0.

For a list of Python modules that are included in each AWS Glue version by default, see Python modules already provided in AWS Glue.

Install C-based Python modules

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

AWS Glue also supports libraries and extensions written in C with the --additional-python-modules parameter. However, some Python modules, such as spacy and grpc, require root permissions to install. AWS Glue doesn't provide root access during package installation. To resolve this issue, precompile the binaries into a wheel compatible with AWS Glue and install that wheel.

To compile a library in a C-based language, the compiler must be compatible with the target operating system and processor architecture. If the library is compiled against a different operating system or processor architecture, then the wheel isn't installed in AWS Glue. Because AWS Glue is a managed service, cluster access isn't available to develop these dependencies.

To precompile a C-based Python module that requires root permissions, complete the following steps:

  1. Launch an Amazon Elastic Compute Cloud (Amazon EC2) Linux instance with enough volume space for your libraries.

  2. Install Docker on the EC2 instance, set up non-sudo access, and then start Docker. To do so, run the following commands:

    Install Docker:

    sudo yum install docker -y

    Set up non-sudo access:

    sudo usermod -a -G docker ec2-user

    Start Docker:

    sudo service docker start
  3. Create a Dockerfile file for the module. For example, to install the grpcio module, create a file called dockerfile_grpcio and copy the following content into the file:

    \# Base for AWS Glue
    FROM amazonlinux
    RUN yum update -y
    RUN yum install shadow-utils.x86\_64 -y
    RUN yum install -y java-1.8.0-openjdk.x86\_64
    RUN yum install -y python3
    RUN yum install -y cython doxygen numpy scipy gcc autoconf automake libtool zlib-devel openssl-devel maven wget protobuf-compiler cmake make gcc-c++
    # Additional components needed for grpcio
    WORKDIR /root
    RUN yum install python3-devel -y
    RUN yum install python-devel -y
    RUN pip3 install wheel
    # Install grpcio and related modules
    RUN pip3 install Cython
    RUN pip3 install cmake scikit-build
    RUN pip3 install grpcio
    # Create a directory for the wheel
    RUN mkdir wheel\_dir
    # Create the wheel
    RUN pip3 wheel grpcio -w wheel\_dir
  4. Run the docker build to build your Dockerfile:

    docker build -f dockerfile\_grpcio .
  5. Restart the Docker daemon:

    sudo service docker restart

    When the docker build command completes, you get a success message that contains your Docker image ID. For example, "Successfully built 1111222233334444". Note the Docker image ID to use in the next step.

  6. Extract the .whl wheel file from the Docker container. To do so, run the following commands:

    Get the Docker image ID:

    docker image ls

    Run the container, but replace 1111222233334444 with your Docker image ID:

    docker run -dit 111122223334444

    Verify the location of the wheel file and retrieve the name of the wheel file, but replace 5555666677778888 with your container ID:

    docker exec -t -i 5555666677778888 ls /root/wheel\_dir/

    Copy the wheel from the Docker container to Amazon EC2:

    docker cp 5555666677778888:/root/wheel\_dir/doc-example-wheel .

    Note: Replace doc-example-wheel with the name of your generated wheel file

  7. To upload the wheel to Amazon S3, run the following commands:

    aws s3 cp doc-example-wheel s3://path/to/wheel/
    aws s3 cp grpcio-1.32.0-cp37-cp37m-linux\_x86\_64.whl s3://aws-glue-add-modules/grpcio/

    Note: Replace grpcio-1.32.0-cp37-cp37m-linux_x86_64.whl with the name of your Python package file.

  8. Open the AWS Glue console.

  9. For the AWS Glue ETL job, under Job parameters, enter the following:
    For Key, enter --additional-python-modules.
    For Value, enter s3://aws-glue-add-modules/grpcio/grpcio-1.32.0-cp37-cp37m-linux_x86_64.whl.

Related information

Using Python libraries with AWS Glue

AWS OFFICIAL
AWS OFFICIALUpdated 4 months ago