AWS Glue notebook SQL command example in Pyspark

0

I need to execute two commands into an Aurora MySQL database, which has a connection in glue already in place. The first command is TRUNCATE TABLE and the second LOAD DATA FROM S3 into a table. I know I could easily do this with a Lambda function in Python but the timeout limit is not enough (15 min), since the data I need to load is an 11GB text file.

I couldn't find any examples on how to do this (run a SQL command using an existing connection). I've built a visual ETL job that works fine, it has a couple of transformations that I need to do, which were not needed if I could run the LOAD FROM S3 command. Also, I suspect it would take less time (currently 7 hours).

Any help appreciated.

profile picture
asked 8 months ago51 views
1 Answer
0
Accepted Answer

For this scenario, I can suggest a few approaches using AWS Glue, focusing on executing SQL commands directly:

  1. Using AWS Glue PySpark job:
from awsglue.context import GlueContext
from pyspark.context import SparkContext
import pymysql

def execute_sql_commands():
    # Initialize Glue context
    glueContext = GlueContext(SparkContext.getOrCreate())
    
    # Get connection information from Glue Connection
    connection = glueContext.extract_jdbc_conf('your-connection-name')
    
    # Connect to Aurora MySQL
    conn = pymysql.connect(
        host=connection['url'].split('/')[-1],
        user=connection['user'],
        password=connection['password'],
        db=connection['databaseName']
    )
    
    try:
        with conn.cursor() as cursor:
            # Execute TRUNCATE
            cursor.execute("TRUNCATE TABLE your_table")
            
            # Execute LOAD DATA FROM S3
            load_command = """
                LOAD DATA FROM S3 's3://your-bucket/your-file.txt'
                INTO TABLE your_table
                FIELDS TERMINATED BY ','
                LINES TERMINATED BY '\\n'
                (column1, column2, ...)
            """
            cursor.execute(load_command)
            
        conn.commit()
    finally:
        conn.close()

# Call the function
execute_sql_commands()
  1. Alternative approach using AWS Glue's built-in connection handler:
from awsglue.context import GlueContext
from pyspark.context import SparkContext

def execute_sql_commands():
    glueContext = GlueContext(SparkContext.getOrCreate())
    
    # Create dynamic frame connection options
    connection_options = {
        "url": "jdbc:mysql://your-aurora-endpoint:3306/database",
        "dbtable": "your_table",
        "user": "username",
        "password": "password",
        "connectionName": "your-connection-name"
    }
    
    # Execute the SQL commands using Glue's JDBC connection
    glueContext.write_dynamic_frame.from_jdbc_conf(
        frame=DynamicFrame.fromDF(spark.createDataFrame([], schema), glueContext, "empty"),
        catalog_connection="your-connection-name",
        connection_options={
            "preactions": "TRUNCATE TABLE your_table;",
            "postactions": """
                LOAD DATA FROM S3 's3://your-bucket/your-file.txt'
                INTO TABLE your_table
                FIELDS TERMINATED BY ','
                LINES TERMINATED BY '\\n'
                (column1, column2, ...)
            """
        }
    )

Key considerations:

1. Set appropriate timeout: AWS Glue jobs can run up to 24 hours, so you won't face the Lambda timeout limitation.

2. IAM roles: Ensure your Glue role has:
   - Access to the Aurora cluster
   - Access to the S3 bucket
   - Permissions to execute LOAD DATA FROM S3

3. Network configuration: Make sure your Glue job runs in a VPC that can access your Aurora cluster.

4. Monitor progress: You can monitor the job progress in CloudWatch logs.

For better performance, you might want to consider:

- Using LOAD DATA CONCURRENT to parallelize the load operation
- Adjusting Aurora's innodb_buffer_pool_size
- Setting appropriate values for local_infile and aurora_load_from_s3_role

The execution time should be significantly faster than your current 7-hour ETL job, as LOAD DATA FROM S3 is optimized for bulk loading data directly into Aurora MySQL.
profile pictureAWS
EXPERT
answered 19 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions