Skip to content

How do I resolve Python module installation failures in AWS Glue Python Shell jobs when using wheel files from Amazon S3 in a private VPC?

6 minute read
Content level: Expert
0

I'm trying to install Python modules in my AWS Glue Python Shell job using wheel files stored in Amazon Simple Storage Service (Amazon S3). My job runs in a private Virtual Private Cloud (VPC) with Amazon S3 access, but the installation keeps failing. When I use the --additional-python-modules parameter, I receive errors indicating that pip modules could not be installed. When I switch to --extra-py-files, the job still fails with timeout errors trying to reach PyPI or dependency re

Short description

AWS Glue Python Shell jobs with --extra-py-files add wheel files to the Python path but don't automatically install them with pip. When Python attempts to import these modules, it discovers missing dependencies declared in the wheel metadata and tries to fetch them from PyPI. In private VPC environments without internet access, this causes installation failures. The solution is to clean the dependency metadata from wheel files to prevent pip from attempting to download additional packages from PyPI.

Resolution

Understanding the root cause

When you use --extra-py-files with wheel files in AWS Glue Python Shell jobs, the wheels are downloaded from Amazon S3 and added to the Python path. However, wheel files contain metadata that declares their dependencies. When Python attempts to use these modules, pip checks the metadata and tries to install any missing dependencies from PyPI. In a private VPC without internet access, this causes the job to fail. The key insight is that even when you provide all dependency wheel files, pip still attempts to validate and install dependencies based on the metadata inside each wheel file. This is why providing files in the correct order doesn't solve the problem.

note: order these whl files in the correct dependency order since the backend logic installs the modules in the order passed to the extra-py-files argument

e.g, for this case using:

  • pycparser
  • cffi
  • typing_extensions
  • cryptography
  • oracledb (or your main package)

Solution: Clean wheel metadata

The solution involves removing dependency declarations from the wheel metadata so pip doesn't attempt to fetch additional packages. Follow these steps:

Step 1: Download and extract the wheel file Wheel files are ZIP archives. Extract them to access the metadata:

# Create a working directory
mkdir wheel_cleanup
cd wheel_cleanup
# Copy your wheel file
cp /path/to/your/package.whl .
# Extract the wheel (it's a ZIP file)
unzip package.whl

Step 2: Locate and modify the metadata Find the METADATA file in the extracted contents:

# The metadata is typically in a .dist-info directory
# Example: cffi-2.0.0.dist-info/METADATA
ls -la *.dist-info/

Open the METADATA file and locate the Requires-Dist entries. These declare the dependencies:

Requires-Dist: pycparser
Requires-Dist: typing-extensions>=4.13.2

Step 3: Remove dependency declarations Edit the METADATA file and remove or comment out all Requires-Dist lines. Keep all other metadata intact:

# Use sed to remove Requires-Dist lines
sed -i '/^Requires-Dist:/d' cffi-2.0.0.dist-info/METADATA

Alternatively, manually edit the file and delete these lines.

Step 4: Repackage the wheel Create a new wheel file with the cleaned metadata:

# Repackage the wheel
zip -r cffi-2.0.0-cp39-cp39-manylinux2014_x86_64-cleaned.whl *
# Verify the new wheel
unzip -l cffi-2.0.0-cp39-cp39-manylinux2014_x86_64-cleaned.whl | grep METADATA

Step 5: Repeat for all dependency wheels Perform steps 1-4 for each wheel file in your dependency chain:

  • pycparser
  • cffi
  • typing_extensions
  • cryptography
  • oracledb (or your main package)

Step 6: Upload cleaned wheels to Amazon S3

aws s3 cp cffi-2.0.0-cp39-cp39-manylinux2014_x86_64-cleaned.whl s3://your-bucket/python-libs/
aws s3 cp pycparser-2.23-py3-none-any-cleaned.whl s3://your-bucket/python-libs/
# Upload all cleaned wheels

Step 7: Update your AWS Glue job configuration Use the cleaned wheel files in your job parameters. Install dependencies in order (dependencies first, then the main package):

{
  "--extra-py-files": "s3://your-bucket/python-libs/pycparser-2.23-py3-none-any-cleaned.whl,s3://your-bucket/python-libs/cffi-2.0.0-cp39-cp39-manylinux2014_x86_64-cleaned.whl,s3://your-bucket/python-libs/typing_extensions-4.15.0-py3-none-any-cleaned.whl,s3://your-bucket/python-libs/cryptography-46.0.3-cp38-abi3-manylinux_2_34_x86_64-cleaned.whl,s3://your-bucket/python-libs/oracledb-3.4.1-cp39-cp39-manylinux2014_x86_64-cleaned.whl"
}

Alternative approach: Automated script

To streamline the process, use this Python script to clean multiple wheels:

import zipfile
import os
import shutil
from pathlib import Path
def clean_wheel_metadata(wheel_path, output_dir):
    """Remove Requires-Dist from wheel metadata"""
    wheel_name = Path(wheel_path).stem
    temp_dir = f"temp_{wheel_name}"
    
    # Extract wheel
    with zipfile.ZipFile(wheel_path, 'r') as zip_ref:
        zip_ref.extractall(temp_dir)
    
    # Find and clean METADATA file
    for root, dirs, files in os.walk(temp_dir):
        if root.endswith('.dist-info'):
            metadata_path = os.path.join(root, 'METADATA')
            if os.path.exists(metadata_path):
                with open(metadata_path, 'r') as f:
                    lines = f.readlines()
                
                # Remove Requires-Dist lines
                cleaned_lines = [line for line in lines if not line.startswith('Requires-Dist:')]
                
                with open(metadata_path, 'w') as f:
                    f.writelines(cleaned_lines)
                
                print(f"Cleaned metadata in {metadata_path}")
    
    # Repackage wheel
    output_wheel = os.path.join(output_dir, f"{wheel_name}-cleaned.whl")
    with zipfile.ZipFile(output_wheel, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(temp_dir):
            for file in files:
                file_path = os.path.join(root, file)
                arcname = os.path.relpath(file_path, temp_dir)
                zipf.write(file_path, arcname)
    
    # Cleanup
    shutil.rmtree(temp_dir)
    print(f"Created cleaned wheel: {output_wheel}")
    return output_wheel
# Usage
wheels = [
    'pycparser-2.23-py3-none-any.whl',
    'cffi-2.0.0-cp39-cp39-manylinux2014_x86_64.whl',
    'typing_extensions-4.15.0-py3-none-any.whl',
    'cryptography-46.0.3-cp38-abi3-manylinux_2_34_x86_64.whl',
    'oracledb-3.4.1-cp39-cp39-manylinux2014_x86_64.whl'
]
os.makedirs('cleaned_wheels', exist_ok=True)
for wheel in wheels:
    clean_wheel_metadata(wheel, 'cleaned_wheels')

Note: Replace the wheel filenames with your actual wheel file names.

Verification

After updating your AWS Glue job with cleaned wheels, run the job and verify:

  1. The job completes successfully without timeout errors
  2. No attempts to reach PyPI in the logs
  3. Your Python code can import the modules correctly Test the imports in your Glue script:
import sys
import oracledb
import cryptography
print(f"oracledb version: {oracledb.__version__}")
print(f"cryptography version: {cryptography.__version__}")
print("All modules imported successfully!")

Important considerations

Python version compatibility: Ensure your wheel files match the Python version used by AWS Glue Python Shell jobs. As of this writing, AWS Glue Python Shell supports Python 3.9. Wheel filenames contain the Python version (e.g., cp39 for Python 3.9).

Architecture compatibility: Use wheels built for the correct architecture. AWS Glue runs on x86_64 Linux, so use manylinux wheels (e.g., manylinux2014_x86_64).

Dependency order: While cleaning metadata eliminates the need for strict ordering, it's still good practice to list dependencies before the packages that use them in the --extra-py-files parameter.

Testing locally: Before deploying to AWS Glue, test your cleaned wheels in a local Python 3.9 environment to ensure they work correctly.

Why --additional-python-modules doesn't work

The --additional-python-modules parameter instructs AWS Glue to use pip to install packages. When you provide Amazon S3 paths to wheel files, pip still validates dependencies and attempts to download missing ones from PyPI. In a private VPC without internet access or PyPI connectivity, this fails. The --extra-py-files approach with cleaned metadata bypasses pip's dependency resolution entirely.

Troubleshooting

Issue: Job still fails with "module not found" errors

Solution: Verify that all transitive dependencies are included. Use pip show package-name on a machine with internet access to see all dependencies, then ensure you've included and cleaned all of them.

Issue: Import errors about missing symbols or incompatible versions

Solution: Ensure all wheels are built for the same Python version and architecture. Check that you haven't accidentally mixed Python 3.8 and 3.9 wheels.

Issue: Wheel repackaging fails

Solution: Ensure you're in the correct directory when running the zip command. The wheel structure must be preserved exactly as it was in the original file.

Related information

AWS
EXPERT
published 12 days ago89 views