Help us improve the AWS re:Post Knowledge Center by sharing your feedback in a brief survey. Your input can influence how we create and update our content to better support your AWS journey.
How do I resolve Python module installation failures in AWS Glue Python Shell jobs when using wheel files from Amazon S3 in a private VPC?
I'm trying to install Python modules in my AWS Glue Python Shell job using wheel files stored in Amazon Simple Storage Service (Amazon S3). My job runs in a private Virtual Private Cloud (VPC) with Amazon S3 access, but the installation keeps failing. When I use the --additional-python-modules parameter, I receive errors indicating that pip modules could not be installed. When I switch to --extra-py-files, the job still fails with timeout errors trying to reach PyPI or dependency re
Short description
AWS Glue Python Shell jobs with --extra-py-files add wheel files to the Python path but don't automatically install them with pip. When Python attempts to import these modules, it discovers missing dependencies declared in the wheel metadata and tries to fetch them from PyPI. In private VPC environments without internet access, this causes installation failures. The solution is to clean the dependency metadata from wheel files to prevent pip from attempting to download additional packages from PyPI.
Resolution
Understanding the root cause
When you use --extra-py-files with wheel files in AWS Glue Python Shell jobs, the wheels are downloaded from Amazon S3 and added to the Python path. However, wheel files contain metadata that declares their dependencies. When Python attempts to use these modules, pip checks the metadata and tries to install any missing dependencies from PyPI. In a private VPC without internet access, this causes the job to fail.
The key insight is that even when you provide all dependency wheel files, pip still attempts to validate and install dependencies based on the metadata inside each wheel file. This is why providing files in the correct order doesn't solve the problem.
note: order these whl files in the correct dependency order since the backend logic installs the modules in the order passed to the extra-py-files argument
e.g, for this case using:
- pycparser
- cffi
- typing_extensions
- cryptography
- oracledb (or your main package)
Solution: Clean wheel metadata
The solution involves removing dependency declarations from the wheel metadata so pip doesn't attempt to fetch additional packages. Follow these steps:
Step 1: Download and extract the wheel file Wheel files are ZIP archives. Extract them to access the metadata:
# Create a working directory mkdir wheel_cleanup cd wheel_cleanup # Copy your wheel file cp /path/to/your/package.whl . # Extract the wheel (it's a ZIP file) unzip package.whl
Step 2: Locate and modify the metadata Find the METADATA file in the extracted contents:
# The metadata is typically in a .dist-info directory # Example: cffi-2.0.0.dist-info/METADATA ls -la *.dist-info/
Open the METADATA file and locate the Requires-Dist entries. These declare the dependencies:
Requires-Dist: pycparser
Requires-Dist: typing-extensions>=4.13.2
Step 3: Remove dependency declarations
Edit the METADATA file and remove or comment out all Requires-Dist lines. Keep all other metadata intact:
# Use sed to remove Requires-Dist lines sed -i '/^Requires-Dist:/d' cffi-2.0.0.dist-info/METADATA
Alternatively, manually edit the file and delete these lines.
Step 4: Repackage the wheel Create a new wheel file with the cleaned metadata:
# Repackage the wheel zip -r cffi-2.0.0-cp39-cp39-manylinux2014_x86_64-cleaned.whl * # Verify the new wheel unzip -l cffi-2.0.0-cp39-cp39-manylinux2014_x86_64-cleaned.whl | grep METADATA
Step 5: Repeat for all dependency wheels Perform steps 1-4 for each wheel file in your dependency chain:
- pycparser
- cffi
- typing_extensions
- cryptography
- oracledb (or your main package)
Step 6: Upload cleaned wheels to Amazon S3
aws s3 cp cffi-2.0.0-cp39-cp39-manylinux2014_x86_64-cleaned.whl s3://your-bucket/python-libs/ aws s3 cp pycparser-2.23-py3-none-any-cleaned.whl s3://your-bucket/python-libs/ # Upload all cleaned wheels
Step 7: Update your AWS Glue job configuration Use the cleaned wheel files in your job parameters. Install dependencies in order (dependencies first, then the main package):
{ "--extra-py-files": "s3://your-bucket/python-libs/pycparser-2.23-py3-none-any-cleaned.whl,s3://your-bucket/python-libs/cffi-2.0.0-cp39-cp39-manylinux2014_x86_64-cleaned.whl,s3://your-bucket/python-libs/typing_extensions-4.15.0-py3-none-any-cleaned.whl,s3://your-bucket/python-libs/cryptography-46.0.3-cp38-abi3-manylinux_2_34_x86_64-cleaned.whl,s3://your-bucket/python-libs/oracledb-3.4.1-cp39-cp39-manylinux2014_x86_64-cleaned.whl" }
Alternative approach: Automated script
To streamline the process, use this Python script to clean multiple wheels:
import zipfile import os import shutil from pathlib import Path def clean_wheel_metadata(wheel_path, output_dir): """Remove Requires-Dist from wheel metadata""" wheel_name = Path(wheel_path).stem temp_dir = f"temp_{wheel_name}" # Extract wheel with zipfile.ZipFile(wheel_path, 'r') as zip_ref: zip_ref.extractall(temp_dir) # Find and clean METADATA file for root, dirs, files in os.walk(temp_dir): if root.endswith('.dist-info'): metadata_path = os.path.join(root, 'METADATA') if os.path.exists(metadata_path): with open(metadata_path, 'r') as f: lines = f.readlines() # Remove Requires-Dist lines cleaned_lines = [line for line in lines if not line.startswith('Requires-Dist:')] with open(metadata_path, 'w') as f: f.writelines(cleaned_lines) print(f"Cleaned metadata in {metadata_path}") # Repackage wheel output_wheel = os.path.join(output_dir, f"{wheel_name}-cleaned.whl") with zipfile.ZipFile(output_wheel, 'w', zipfile.ZIP_DEFLATED) as zipf: for root, dirs, files in os.walk(temp_dir): for file in files: file_path = os.path.join(root, file) arcname = os.path.relpath(file_path, temp_dir) zipf.write(file_path, arcname) # Cleanup shutil.rmtree(temp_dir) print(f"Created cleaned wheel: {output_wheel}") return output_wheel # Usage wheels = [ 'pycparser-2.23-py3-none-any.whl', 'cffi-2.0.0-cp39-cp39-manylinux2014_x86_64.whl', 'typing_extensions-4.15.0-py3-none-any.whl', 'cryptography-46.0.3-cp38-abi3-manylinux_2_34_x86_64.whl', 'oracledb-3.4.1-cp39-cp39-manylinux2014_x86_64.whl' ] os.makedirs('cleaned_wheels', exist_ok=True) for wheel in wheels: clean_wheel_metadata(wheel, 'cleaned_wheels')
Note: Replace the wheel filenames with your actual wheel file names.
Verification
After updating your AWS Glue job with cleaned wheels, run the job and verify:
- The job completes successfully without timeout errors
- No attempts to reach PyPI in the logs
- Your Python code can import the modules correctly Test the imports in your Glue script:
import sys import oracledb import cryptography print(f"oracledb version: {oracledb.__version__}") print(f"cryptography version: {cryptography.__version__}") print("All modules imported successfully!")
Important considerations
Python version compatibility: Ensure your wheel files match the Python version used by AWS Glue Python Shell jobs. As of this writing, AWS Glue Python Shell supports Python 3.9. Wheel filenames contain the Python version (e.g., cp39 for Python 3.9).
Architecture compatibility: Use wheels built for the correct architecture. AWS Glue runs on x86_64 Linux, so use manylinux wheels (e.g., manylinux2014_x86_64).
Dependency order: While cleaning metadata eliminates the need for strict ordering, it's still good practice to list dependencies before the packages that use them in the --extra-py-files parameter.
Testing locally: Before deploying to AWS Glue, test your cleaned wheels in a local Python 3.9 environment to ensure they work correctly.
Why --additional-python-modules doesn't work
The --additional-python-modules parameter instructs AWS Glue to use pip to install packages. When you provide Amazon S3 paths to wheel files, pip still validates dependencies and attempts to download missing ones from PyPI. In a private VPC without internet access or PyPI connectivity, this fails. The --extra-py-files approach with cleaned metadata bypasses pip's dependency resolution entirely.
Troubleshooting
Issue: Job still fails with "module not found" errors
Solution: Verify that all transitive dependencies are included. Use pip show package-name on a machine with internet access to see all dependencies, then ensure you've included and cleaned all of them.
Issue: Import errors about missing symbols or incompatible versions
Solution: Ensure all wheels are built for the same Python version and architecture. Check that you haven't accidentally mixed Python 3.8 and 3.9 wheels.
Issue: Wheel repackaging fails
Solution: Ensure you're in the correct directory when running the zip command. The wheel structure must be preserved exactly as it was in the original file.
Related information
- Language
- English
Relevant content
- asked 3 years ago
AWS OFFICIALUpdated 10 months ago
AWS OFFICIALUpdated 2 months ago
AWS OFFICIALUpdated 2 months ago