By using AWS re:Post, you agree to the Terms of Use

AWS Glue is taking very long time to read data from MySQL table (61 millions)

0

I have a glue job reading a table from MySQL (61 millions records). Job is taking ages when performing join operation with mapping file (400 records). I tried parallel JDBC read (from catalog and from options) but no luck with performance.

code : connection_option= {"url": "jdbc:mysql://<connection string>", "user": "<user>", "password": "<pass>","dbtable": "dbname", "hashexpression":"column name","hashpartitions":"10"} dy_trades = glueContext.create_dynamic_frame.from_options ('mysql',connection_options=connection_option, transformation_ctx = "dy_trades")

Even if don't perform any transformations and try to load data in S3 file (connection with bulkSize 10), job is taking very long time. Running job with 100 workers, glue 3.0 - spark.

Can you please guide on performance improvement? am I missing anything in the job? Please

1 Answers
0

Hello,

With regards to job taking ages while performing the join operation --- In cases where one of the tables in the join is small (here the one with 400 records), we can indicate Spark to handle it differently reducing the overhead of shuffling data. This is performed by hinting Apache Spark that the smaller table should be broadcasted instead of partitioned and shuffled across the network. The Spark parameter spark.sql.autoBroadcastJoinThreshold configures the maximum size, in bytes, for a table that will be broadcast to all worker nodes when performing a join.

Also, I could observe you are using G.1X worker. Given that G.1X worker type may not necessarily be the most optimal worker types when it comes to memory-intensive jobs, I would recommend you to upgrade it to the G.2X worker. For more details on AWS Glue Worker types, see the documentation on AWS Glue Jobs.


References:

  1. Performance Tuning - Spark 3.3.0 Documentation (apache.org) [link]
  2. Optimize memory management in AWS Glue | AWS Big Data Blog (amazon.com) [link]
  3. Best practices to scale Apache Spark jobs and partition data with AWS Glue | AWS Big Data Blog (amazon.com) [link]
SUPPORT ENGINEER
answered 14 days ago
  • I have gone through the provided link. Imported "from pyspark.sql.functions import broadcast" and joined as below (both are DataFrames 'df_tradeType' and 'df_analyticsExtra') "df_withMeta= df_tradeType.join(df_analyticsExtra, df_tradeType.report_type == df_analyticsExtra.report_type).drop(df_analyticsExtra.payment_rating).drop(df_analyticsExtra.account_status).drop(df_analyticsExtra.report_type)"

    However when trying to covert Dataframe 'df_withMeta' in Dynamic Frame, it is taking time. I am converting it to load in Redshift table as "glueContext.write_dynamic_frame.from_catalog( frame=df_withMeta,..... ". Is there any other way, I can load data in Redshift without converting the DataFrame ?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions