could you please confirm if I understood well the scenario?
You have one AWS Glue job (Spark job) which:
- reads 2 tables from an RDS database
- joins the 2 tables
- writes to a third table in a different Database
- you only want to process incremental data
- you have enabled Job Bookmarks for the job?
AWS Glue when you write to a jdbc database only INSERT Data, if you want to capture the new data you are going to enable the job bookmarks. Job bookmark if enabled will apply to all table read in the job and if you are using a jdbc connection will use by default the primary key to check for new data, unless you specify a different key.
If the two tables have different update schedules on the source it may happen that the join results in an empty datasets and no additional data would be inserted in the target DB.
If you need to join the incremental data from Table A with the whole of table B, you would need to break the job and create a workflow.
Job 1 reads the full of table B (job bookmark disabled) and writes it to a temporary file on S3.
Job 2 reads the table A with bookmark enabled and the temporary data set from S3, it makes the join and writes to the target DB
job 2 removes the temporary file if the write to target DB is successful.
please notice that his would work if you need to always insert new data. if you need to UPDATE the target table , it would be more complex, and to provide some guidance I would need additional details.
hope this helps,
As was mentioned, the dataframe/JDBC supports INSERTs or overwriting entire datasets.
To do an UPDATE would require some work. Basically, you would share the JDBC connection properties with the executors using a
broadcast variable, then have a DataFrame that contains records requiring an update, perform a
foreachPartition call on that DataFrame where you would then make a JDBC connection and then loop through the partition with a
foreach call where you would perform the UPDATE.
You can find a Scala example here: https://medium.com/@thomaspt748/how-to-upsert-data-into-relational-database-using-spark-7d2d92e05bb9
- asked 4 months ago
- Why does my AWS Glue job fail with lost nodes when I migrate a large data set from Amazon RDS to Amazon S3?AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
- Why did my Secrets Manager Lambda rotation function fail with a “Database engine must be set to 'postgres'/'mysql'” error?AWS OFFICIALUpdated 8 months ago
- EXPERTpublished 10 months ago
- EXPERTpublished 2 years ago