Skip to content

AWS Glue notebook SQL command example in Pyspark

0

I need to execute two commands into an Aurora MySQL database, which has a connection in glue already in place. The first command is TRUNCATE TABLE and the second LOAD DATA FROM S3 into a table. I know I could easily do this with a Lambda function in Python but the timeout limit is not enough (15 min), since the data I need to load is an 11GB text file.

I couldn't find any examples on how to do this (run a SQL command using an existing connection). I've built a visual ETL job that works fine, it has a couple of transformations that I need to do, which were not needed if I could run the LOAD FROM S3 command. Also, I suspect it would take less time (currently 7 hours).

Any help appreciated.

asked 2 years ago305 views
1 Answer
0
Accepted Answer

For this scenario, I can suggest a few approaches using AWS Glue, focusing on executing SQL commands directly:

  1. Using AWS Glue PySpark job:
from awsglue.context import GlueContext
from pyspark.context import SparkContext
import pymysql

def execute_sql_commands():
    # Initialize Glue context
    glueContext = GlueContext(SparkContext.getOrCreate())
    
    # Get connection information from Glue Connection
    connection = glueContext.extract_jdbc_conf('your-connection-name')
    
    # Connect to Aurora MySQL
    conn = pymysql.connect(
        host=connection['url'].split('/')[-1],
        user=connection['user'],
        password=connection['password'],
        db=connection['databaseName']
    )
    
    try:
        with conn.cursor() as cursor:
            # Execute TRUNCATE
            cursor.execute("TRUNCATE TABLE your_table")
            
            # Execute LOAD DATA FROM S3
            load_command = """
                LOAD DATA FROM S3 's3://your-bucket/your-file.txt'
                INTO TABLE your_table
                FIELDS TERMINATED BY ','
                LINES TERMINATED BY '\\n'
                (column1, column2, ...)
            """
            cursor.execute(load_command)
            
        conn.commit()
    finally:
        conn.close()

# Call the function
execute_sql_commands()
  1. Alternative approach using AWS Glue's built-in connection handler:
from awsglue.context import GlueContext
from pyspark.context import SparkContext

def execute_sql_commands():
    glueContext = GlueContext(SparkContext.getOrCreate())
    
    # Create dynamic frame connection options
    connection_options = {
        "url": "jdbc:mysql://your-aurora-endpoint:3306/database",
        "dbtable": "your_table",
        "user": "username",
        "password": "password",
        "connectionName": "your-connection-name"
    }
    
    # Execute the SQL commands using Glue's JDBC connection
    glueContext.write_dynamic_frame.from_jdbc_conf(
        frame=DynamicFrame.fromDF(spark.createDataFrame([], schema), glueContext, "empty"),
        catalog_connection="your-connection-name",
        connection_options={
            "preactions": "TRUNCATE TABLE your_table;",
            "postactions": """
                LOAD DATA FROM S3 's3://your-bucket/your-file.txt'
                INTO TABLE your_table
                FIELDS TERMINATED BY ','
                LINES TERMINATED BY '\\n'
                (column1, column2, ...)
            """
        }
    )

Key considerations:

1. Set appropriate timeout: AWS Glue jobs can run up to 24 hours, so you won't face the Lambda timeout limitation.

2. IAM roles: Ensure your Glue role has:
   - Access to the Aurora cluster
   - Access to the S3 bucket
   - Permissions to execute LOAD DATA FROM S3

3. Network configuration: Make sure your Glue job runs in a VPC that can access your Aurora cluster.

4. Monitor progress: You can monitor the job progress in CloudWatch logs.

For better performance, you might want to consider:

- Using LOAD DATA CONCURRENT to parallelize the load operation
- Adjusting Aurora's innodb_buffer_pool_size
- Setting appropriate values for local_infile and aurora_load_from_s3_role

The execution time should be significantly faster than your current 7-hour ETL job, as LOAD DATA FROM S3 is optimized for bulk loading data directly into Aurora MySQL.
AWS
EXPERT
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.