- Newest
- Most votes
- Most comments
Hello,
With regards to job taking ages while performing the join operation ---
In cases where one of the tables in the join is small (here the one with 400 records), we can indicate Spark to handle it differently reducing the overhead of shuffling data. This is performed by hinting Apache Spark that the smaller table should be broadcasted instead of partitioned and shuffled across the network. The Spark parameter spark.sql.autoBroadcastJoinThreshold
configures the maximum size, in bytes, for a table that will be broadcast to all worker nodes when performing a join.
Also, I could observe you are using G.1X worker. Given that G.1X worker type may not necessarily be the most optimal worker types when it comes to memory-intensive jobs, I would recommend you to upgrade it to the G.2X worker. For more details on AWS Glue Worker types, see the documentation on AWS Glue Jobs.
References:
Relevant content
- asked 9 months ago
- Accepted Answerasked 6 years ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
I have gone through the provided link. Imported "from pyspark.sql.functions import broadcast" and joined as below (both are DataFrames 'df_tradeType' and 'df_analyticsExtra') "df_withMeta= df_tradeType.join(df_analyticsExtra, df_tradeType.report_type == df_analyticsExtra.report_type).drop(df_analyticsExtra.payment_rating).drop(df_analyticsExtra.account_status).drop(df_analyticsExtra.report_type)"
However when trying to covert Dataframe 'df_withMeta' in Dynamic Frame, it is taking time. I am converting it to load in Redshift table as "glueContext.write_dynamic_frame.from_catalog( frame=df_withMeta,..... ". Is there any other way, I can load data in Redshift without converting the DataFrame ?