AWS Glue & pyspark : How to improve performance on a medium to big scaled table

0

Hello team,

So, I built an ETL in python using pyspark. I have a bastion EC2 mysql database that is a copy of a production environment. Every day it is copying the prod at round 2 oclock, and my ETL process start : I copy the data from the mysql EC2 using this kindof function:
try: tmp_data_frame = spark.read.format("jdbc") \ .option("url", jdbc_url) \ .option("user", username) \ .option("password", password) \ .option("query", query) \ .load() dynamic_frame = DynamicFrame.fromDF(tmp_data_frame, self.glue_context) self._print_data_details(dynamic_frame) return dynamic_frame

Because I have case where I need to do a join between two big tables in source to get the data that need to be extract. (ex: table 1 has id + lastupdate date, table 2 has relevant data and id). Or sometime I want to use query with where in it to reduce the size of the data to extract.

Thing is, I find the performance very disappointing. I've got 30 min extract time for a query with 250k rows. source table has idx. Even full extract of a 4.5millions rows can take up to 41min using a self.glue_context.create_dynamic_frame.from_catalog from the extract to load into a staging table, without any transformation.

Is there any good practice you guys would want to share that could be usefull for performance ?

Ted
asked 2 months ago168 views
1 Answer
0
Accepted Answer

Here are some tips that might help you enhance the efficiency of your ETL jobs:

  • Partition your data based on relevant keys that are often queried or filtered upon.
  • Bucketing can also be useful for distributing data across multiple files in a more organized manner, especially for joins on large tables.
  • Adjust Spark configurations to optimize for memory and CPU usage.
  • Use efficient columnar storage formats like Parquet or ORC. These formats are highly optimized for read performance and compression.
  • When using JDBC to fetch data, increase fetch size to reduce the number of round trips to the database.
  • Ensure that data is evenly distributed across partitions to avoid data skewness, which can lead to performance bottlenecks.
  • Cache intermediate datasets that are reused during the computation.
  • Depending on your workload, consider scaling instances.
  • Ensure that your database is optimized for read performance.
  • If possible, split your data extraction process into multiple parallel reads.

If this has answered your question or was helpful, accepting the answer would be greatly appreciated. Thank you!

profile picture
EXPERT
answered 2 months ago
AWS
SUPPORT ENGINEER
reviewed a month ago
  • Thank you for your answer, it is helpful indeed. I am already doing some of your point. Could you elaborate on the spark configuration ? Also, I have difficulties to find how many rows per partition I should have. I'm building a dynamic system meaning I have a function that count the number of source rows and create the partition accordingly. Should the partition have 1 millions rows each for huge table for example ?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions