Include additional python module in Glue job

0

I am trying to include python-oracledb in my job. I have followed the instructions from here, saving various versions of the relevant .whl files from PyPi. I have set the Glue job parameter --additional-python-modules as the key and the value as the S2 URI.

When i run my job I still get the NoModuleFoundError: No module named oracledb.

Please help.

    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    import boto3
    import oracledb

Job params

  • Just found the error in the log, it is not supported. Any idea if it will ever be supported.

asked 18 days ago26 views
2 Answers
0

Hi,

I understand you wish to use python-oracledb in you Glue PySpark ETL job. I did some tests with my test environment and I'm able to confirm this can be done by either of the following approaches:

  1. If your Glue job runs in a VPC subnet with public Internet access (a NAT gateway is required since Glue workers don't have public Ip address [1]). You can specify the job parameter like this:
Key:  --additional-python-modules
Value:  oracledb
  1. If your Glue job runs in a VPC without internet access, you must create a Python repository on Amazon S3 by following this documentation [2] and include oracledb in your "modules_to_install.txt" file. Then, you should be able to install the package from your own Python repository on S3 by using following parameters. (make sure replace the MY-BUCKET with the real bucket name according to your use case)
"--additional-python-modules" : "oracledb",
"--python-modules-installer-option" : "--no-index --find-links=http://MY-BUCKET.s3-website-us-east-1.amazonaws.com/wheelhouse --trusted-host MY-BUCKET.s3-website-us-east-1.amazonaws.com"

Ref:

[1] https://aws.amazon.com/premiumsupport/knowledge-center/nat-gateway-vpc-private-subnet/

[2] https://aws.amazon.com/blogs/big-data/building-python-modules-from-a-wheel-for-spark-etl-workloads-using-aws-glue-2-0/

Ethan_H
answered 14 days ago
0

Hello,

Thank you for your question. My name is Yvonne, from RDS team.

From your question I understand that you experienced an error "NoModuleFoundError: No module named oracledb" and also noticed the error in the log "it is not supported" while trying to include python-oracledb in your Glue job, so you want to know when it will be supported.

Unfortunately i am not able to provide the timelines as our development team has their own timelines however we announce all new features when we release them in below blogs [1] [2] .

Please note that 'additional-python-modules' is applicable for Spark Glue Job with Glue version 2.0 and 3.0. You can include the external python library as mentioned in the link[3] .

For supported versions please refer to the below documentation:

[+] https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html

In case you require further assistance or have any queries, feel free to respond back to the case and I will be happy to assist you.

References:

[1] https://aws.amazon.com/new/
[2] https://aws.amazon.com/blogs/aws/
[3] https://docs.aws.amazon.com/glue/latest/dg/reduced-start-times-spark-etl-jobs.html#reduced-start-times-limitations

answered 14 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions