Glue Installation of Python modules timed out after 10 minutes

0

Hello, I have an AWS Glue 5.0 job where I am specifying --additional-python-modules s3://my-dev/other-dependencies /MyPackage-0.1.1-py3-none-any.whl in my job options. My glue job itself is just a print("hello") job, because when I save and run this glue job, it runs for 10 minutes 7 seconds in AWS< then fails with this error

LAUNCH ERROR | Glue bootstrap failed. Please refer logs for details. caused by LAUNCH ERROR | Installation of Python modules timed out after 10 minutesPlease refer logs for details.

my packages setup.py does use thee libraires:

from setuptools import setup, find_packages

setup(
    name="...",
    version="0.1.1",
    packages=find_packages(),
    install_requires=[
        'dask[array]',  
        'zarr',
        'scipy',
        'scikit-image',
        'bioio',
        'bioio-tifffile',
        'tifffile',
        'opencv-python',
        'torch',
        'pyyaml',
        'xmltodict'
    ],
    entry_points={...},
)

Is there anyway I can tell AWS with a flag in an AWS Glue 5.0 job that I need more then a 10 minute install timeout?

asked 2 months ago67 views
2 Answers
0

In AWS Glue 5.0, there isn't a direct flag to extend the 10-minute timeout for Python module installation. However, there are a few approaches you can try to address this issue:

  1. Use a requirements.txt file: Instead of specifying the modules directly in the job parameters, you can create a requirements.txt file with your dependencies and upload it to S3. Then, use the following job parameters:

    --python-modules-installer-option -r --additional-python-modules s3://path_to_your_requirements.txt

This method allows AWS Glue to manage dependencies more efficiently.

  1. Pre-install heavy dependencies: For large packages like torch, scipy, or scikit-image, you might want to include them in the --additional-python-modules parameter separately, rather than as part of your custom wheel file. This can help distribute the installation load.

  2. Optimize your package: Try to minimize the dependencies in your custom package. If possible, include only the necessary modules and consider using lighter alternatives for some libraries.

  3. Use a custom runtime: If the above methods don't work, you might need to create a custom runtime for your Glue job that includes these dependencies pre-installed.

  4. Split dependencies: If possible, try splitting your dependencies across multiple smaller packages. This might allow each to install within the time limit.

Remember, the 10-minute timeout is a system limitation and there's no direct way to extend it. If these approaches don't resolve the issue, you may need to contact AWS support for further assistance or consider restructuring your job to work within these constraints.
Sources
Introducing AWS Glue 5.0 for Apache Spark | AWS Big Data Blog
Using Python libraries with AWS Glue - AWS Glue

profile picture
answered 2 months ago
  • how would I add this flag point to my req.txt in the aws glue web UI ? I would need to import the .whl and req.txt as two separate commands?

0

It looks like the packages you are trying to install, have a lot of transient dependencies, which would require you to reach to the internet for downloading them. If you don't have any glue connection, you can attach a network type glue connection and make sure you attach a NAT gateway to your connection subnet. Make sure you can reach out to the internet.

profile picture
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions