With regards to job taking ages while performing the join operation ---
In cases where one of the tables in the join is small (here the one with 400 records), we can indicate Spark to handle it differently reducing the overhead of shuffling data. This is performed by hinting Apache Spark that the smaller table should be broadcasted instead of partitioned and shuffled across the network. The Spark parameter
spark.sql.autoBroadcastJoinThreshold configures the maximum size, in bytes, for a table that will be broadcast to all worker nodes when performing a join.
Also, I could observe you are using G.1X worker. Given that G.1X worker type may not necessarily be the most optimal worker types when it comes to memory-intensive jobs, I would recommend you to upgrade it to the G.2X worker. For more details on AWS Glue Worker types, see the documentation on AWS Glue Jobs.
Is it possible to call rest API from AWS glue jobasked 4 months ago
AWS Glue visual jobasked 3 months ago
Glue Workflow: Add same glue job multiple times, each with different parametersasked 2 months ago
Update Records with AWS Glueasked 4 months ago
AWS Glue job updating an existing tableasked a month ago
Glue table not showing in consoleasked 4 months ago
AWS Glue is taking very long time to read data from MySQL table (61 millions)asked 15 days ago
call a stored procedure from within glue jobAccepted Answerasked 7 months ago
Update postgres RDS table with AWS glue scriptasked 4 months ago
Need inputs on Glue to move data from Aurora MySQL to Redshiftasked 2 months ago