Hello,
With regards to job taking ages while performing the join operation ---
In cases where one of the tables in the join is small (here the one with 400 records), we can indicate Spark to handle it differently reducing the overhead of shuffling data. This is performed by hinting Apache Spark that the smaller table should be broadcasted instead of partitioned and shuffled across the network. The Spark parameter spark.sql.autoBroadcastJoinThreshold
configures the maximum size, in bytes, for a table that will be broadcast to all worker nodes when performing a join.
Also, I could observe you are using G.1X worker. Given that G.1X worker type may not necessarily be the most optimal worker types when it comes to memory-intensive jobs, I would recommend you to upgrade it to the G.2X worker. For more details on AWS Glue Worker types, see the documentation on AWS Glue Jobs.
References:
Relevant questions
Is it possible to call rest API from AWS glue job
asked 4 months agoAWS Glue visual job
asked 3 months agoGlue Workflow: Add same glue job multiple times, each with different parameters
asked 2 months agoUpdate Records with AWS Glue
asked 4 months agoAWS Glue job updating an existing table
asked a month agoGlue table not showing in console
asked 4 months agoAWS Glue is taking very long time to read data from MySQL table (61 millions)
asked 15 days agocall a stored procedure from within glue job
Accepted Answerasked 7 months agoUpdate postgres RDS table with AWS glue script
asked 4 months agoNeed inputs on Glue to move data from Aurora MySQL to Redshift
asked 2 months ago
I have gone through the provided link. Imported "from pyspark.sql.functions import broadcast" and joined as below (both are DataFrames 'df_tradeType' and 'df_analyticsExtra') "df_withMeta= df_tradeType.join(df_analyticsExtra, df_tradeType.report_type == df_analyticsExtra.report_type).drop(df_analyticsExtra.payment_rating).drop(df_analyticsExtra.account_status).drop(df_analyticsExtra.report_type)"
However when trying to covert Dataframe 'df_withMeta' in Dynamic Frame, it is taking time. I am converting it to load in Redshift table as "glueContext.write_dynamic_frame.from_catalog( frame=df_withMeta,..... ". Is there any other way, I can load data in Redshift without converting the DataFrame ?