How to speed up my Glue ETL process?

0

I am still learning to use the Glue ETL process for building new aggregate tables and need help optimizing my ETL job.

My ETL job is designed to run once per day in the mornings and pull in all the active users from the previous day along with a bunch of their aggregated stats. To do this, I have set up an ETL job in Glue. I selected the Visual ETL and am using a SQL transformation that pulls data from various tables in my Data Catalogue. It then exports the output into a new S3 Bucket and creates a new table in the Data Catalogue. It is set to "Create a table in the Data Catalog and on subsequent runs, keep existing schema and add new partitions".

To test my query output I have just been running the query in Athena. The query takes about 8 minutes to run in Athena and produces about 15k rows for one day. However, when using that same query in the SQL transformation for the ETL, the job takes almost 4 hours to run with 10 DPUs!

Is there anything I might not know about that could help speed up my ETL job? Thank you!

Jjo
asked 8 months ago324 views
1 Answer
0

The visual job is going to source with DynamicFrame, not apply pushdown filters automatically and probably read the files to determine the schema when converting to DataFrame so you can do the SQL. In your case, if you write a script job that runs the query directly on spark.sql(), you will get something closer (it won't be fast as Athena, specially with that modest capacity).

profile pictureAWS
EXPERT
answered 8 months ago
  • Forgive me if I am totally wrong here, but I think my job is already doing this. The Visual tool generates an automatic script and I'm seeing the following nodes with the pushdown predicate, such as:

    event_ad_view_node1693352511778 = glueContext.create_dynamic_frame.from_catalog(
        database="database",
        push_down_predicate="to_date(dt) >= date_sub(current_date, 2)",
        table_name="event_ad_view",
        transformation_ctx="event_ad_view_node1693352511778",
    

    Does this not imply that it is in fact filtering the dataset using the pushdown predicate as it is creating the dynamic frame?

    It's important to note that not all of the tables I am using are partitioned so I am unable to apply a pushdown predicate to all of them. I have yet to find another way to filter on these tables.

    Later in the script it uses sparkSqlQuery to pull the query:

    SqlQuery0 = """ my query
    
    UserDailyQuery_node1693352569830 = sparkSqlQuery(
        glueContext,
        query=SqlQuery0,
        mapping={
    ......
    }
    

    SparkSQLQuery is defined as the following:

    def sparkSqlQuery(glueContext, query, mapping, transformation_ctx) -> DynamicFrame:
        for alias, frame in mapping.items():
            frame.toDF().createOrReplaceTempView(alias)
        result = spark.sql(query)
        return DynamicFrame.fromDF(result, glueContext, transformation_ctx)
    

    Is this the method you were referring to or is there another way?

  • In this case, you are going to have more control if you use a script job, also the plan on SparkUI will be simpler to understand so you can see where it time spent

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions