call a stored procedure from within glue job

0

Hi team,

I want to call my stored procedure at the end of my glue job (inside the glue job itself),

can I connect to MySQL RDS DB and call this stored procedure from my pyspark glue job without using py4j?

I don't find examples on the net (this one speak about using py4j: https://stackoverflow.com/questions/64775753/how-to-run-arbitrary-ddl-sql-statements-or-stored-procedures-using-aws-glue)

any code example would be helpful. Thanks a lot

  • if your glue ETL jobs needs multiple workers and the stored procedure it is not immediate, you might be wasting resources. It might be better to run the Stored procedure from a Glue Python Shell job that gets executed as soon as your Glue Spark job is completed in a workflow.

1 Answer
0
Accepted Answer

I used pymysql and worked for me.

this is an example of the code I used :

import sys
import pymysql

........

rds_host  = "database-id.rds.amazonaws.com"
name = "userName"
password = "secretPassword"
db_name = "myDb"
conn = pymysql.connect(host=rds_host, user=name, passwd=password, db=db_name, connect_timeout=5)

with conn.cursor() as cur:
   query="CALL mystoredProcedure()"
   cur.execute(query)
   conn.commit()
   cur.close()
Jess
answered 2 years ago
AWS
EXPERT
reviewed 2 years ago
  • @Jess , Is there any security group or VPC setting required for this ? I am doing the same thing but the connection is failing. My DB is under a VPC

  • Hi, as pointed in the above comment, I am also facing RDS instance connectivity issues when using the above solution. My RDS instance is in a VPC and I have configured inbound rules for the security group but still cannot connect. Below is the exact error message:

    OperationalError: (20009, b'DB-Lib error message 20009, severity 9:\nUnable to connect: Adaptive Server is unavailable or does not exist (gluetest.cgj1ict7xclx.ap-southeast-1.rds.amazonaws.com)\nNet-Lib error during Connection timed out (110)\nDB-Lib error message 20009, severity 9:\nUnable to connect: Adaptive Server is unavailable or does not exist (gluetest.cgj1ict7xclx.ap-southeast-1.rds.amazonaws.com)\nNet-Lib error during Connection timed out (110)\n')

  • you need to put jdbc:mysql://

    jdbc:mysql://you_db_cluster_endpoint:3306/yourDb_name?verifyServerCertificate=false&useSSL=true

    maybe try this :

    import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from botocore.exceptions import ClientError import boto3 from awsglue.dynamicframe import DynamicFrame import pyspark.sql.functions as f

    glue_connection_name = args['GLUE_CONNECTION_NAME'] // your glue connection database_name = args['DB_NAME']

    print('Getting details for connection ' + glue_connection_name) source_jdbc_conf = glueContext.extract_jdbc_conf(glue_connection_name)

    from py4j.java_gateway import java_import java_import(sc._gateway.jvm,"java.sql.Connection") java_import(sc._gateway.jvm,"java.sql.DatabaseMetaData") java_import(sc._gateway.jvm,"java.sql.DriverManager") java_import(sc._gateway.jvm,"java.sql.SQLException")

    conn = sc._gateway.jvm.DriverManager.getConnection(source_jdbc_conf.get('url') + '/' + database_name, source_jdbc_conf.get('user'), source_jdbc_conf.get('password')) print('Connected to ' + conn.getMetaData().getDatabaseProductName() + ', ' + source_jdbc_conf.get('url'))

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions