How to speed up my Glue ETL process?

Question

I am still learning to use the Glue ETL process for building new aggregate tables and need help optimizing my ETL job.

My ETL job is designed to run once per day in the mornings and pull in all the active users from the previous day along with a bunch of their aggregated stats. To do this, I have set up an ETL job in Glue. I selected the Visual ETL and am using a SQL transformation that pulls data from various tables in my Data Catalogue. It then exports the output into a new S3 Bucket and creates a new table in the Data Catalogue. It is set to "Create a table in the Data Catalog and on subsequent runs, keep existing schema and add new partitions".

To test my query output I have just been running the query in Athena. The query takes about 8 minutes to run in Athena and produces about 15k rows for one day. However, when using that same query in the SQL transformation for the ETL, the job takes almost 4 hours to run with 10 DPUs!

Is there anything I might not know about that could help speed up my ETL job? Thank you!

Answer

The visual job is going to source with DynamicFrame, not apply pushdown filters automatically and probably read the files to determine the schema when converting to DataFrame so you can do the SQL. In your case, if you write a script job that runs the query directly on spark.sql(), you will get something closer (it won't be fast as Athena, specially with that modest capacity).

How to speed up my Glue ETL process?

Relevanter Inhalt