AWS Glue is taking very long time to read data from MySQL table (61 millions)

0

I have a glue job reading a table from MySQL (61 millions records). Job is taking ages when performing join operation with mapping file (400 records). I tried parallel JDBC read (from catalog and from options) but no luck with performance.

code : connection_option= {"url": "jdbc:mysql://<connection string>", "user": "<user>", "password": "<pass>","dbtable": "dbname", "hashexpression":"column name","hashpartitions":"10"} dy_trades = glueContext.create_dynamic_frame.from_options ('mysql',connection_options=connection_option, transformation_ctx = "dy_trades")

Even if don't perform any transformations and try to load data in S3 file (connection with bulkSize 10), job is taking very long time. Running job with 100 workers, glue 3.0 - spark.

Can you please guide on performance improvement? am I missing anything in the job? Please

posta 2 anni fa1996 visualizzazioni
1 Risposta
0

Hello,

With regards to job taking ages while performing the join operation --- In cases where one of the tables in the join is small (here the one with 400 records), we can indicate Spark to handle it differently reducing the overhead of shuffling data. This is performed by hinting Apache Spark that the smaller table should be broadcasted instead of partitioned and shuffled across the network. The Spark parameter spark.sql.autoBroadcastJoinThreshold configures the maximum size, in bytes, for a table that will be broadcast to all worker nodes when performing a join.

Also, I could observe you are using G.1X worker. Given that G.1X worker type may not necessarily be the most optimal worker types when it comes to memory-intensive jobs, I would recommend you to upgrade it to the G.2X worker. For more details on AWS Glue Worker types, see the documentation on AWS Glue Jobs.


References:

  1. Performance Tuning - Spark 3.3.0 Documentation (apache.org) [link]
  2. Optimize memory management in AWS Glue | AWS Big Data Blog (amazon.com) [link]
  3. Best practices to scale Apache Spark jobs and partition data with AWS Glue | AWS Big Data Blog (amazon.com) [link]
AWS
TECNICO DI SUPPORTO
Nitin_S
con risposta 2 anni fa
  • I have gone through the provided link. Imported "from pyspark.sql.functions import broadcast" and joined as below (both are DataFrames 'df_tradeType' and 'df_analyticsExtra') "df_withMeta= df_tradeType.join(df_analyticsExtra, df_tradeType.report_type == df_analyticsExtra.report_type).drop(df_analyticsExtra.payment_rating).drop(df_analyticsExtra.account_status).drop(df_analyticsExtra.report_type)"

    However when trying to covert Dataframe 'df_withMeta' in Dynamic Frame, it is taking time. I am converting it to load in Redshift table as "glueContext.write_dynamic_frame.from_catalog( frame=df_withMeta,..... ". Is there any other way, I can load data in Redshift without converting the DataFrame ?

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande