Update postgres RDS table with AWS glue script

Question

I would like to transfer data from postgres RDS database tables to a new reporting database, also created as a postgres RDS. I created data catalogs and created a script that should join two tables together and then save data to a reporting database. It is working as intended only when run for the first time - it saves current data to new database. I need to update the reporting database daily with newly added records to a basic database, but it is not saving new data for me. Is there any way to insert only new data into a db with AWS Glue?

Answer

Hi,

could you please confirm if I understood well the scenario?

You have one AWS Glue job  (Spark job) which:

1.  reads 2 tables from an RDS database
2. joins the 2 tables
3. writes to a third table in a different Database
4. you only want to process incremental data
5. you have enabled Job Bookmarks for the job?

AWS Glue when you write to a jdbc database only INSERT Data, if you want to capture the new data you are going to enable the [job bookmarks](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html).
Job bookmark if enabled will apply to all table read in the job and if you are using a jdbc connection will use by default the primary key to check for new data, unless you specify a different key.

If the two tables have different update schedules on the source it may happen that the join results in an empty datasets and no additional data would be inserted in the target DB.

If you need to join the incremental data from Table A with the whole of table B, you would need to break the job and create a workflow.

1. Job 1 reads the full of table B (job bookmark disabled) and writes it to a temporary file on S3.

2. Job 2 reads the table A with bookmark enabled and the temporary data set from S3, it makes the join and writes to the target DB

3. job 2 removes the temporary file if the write to target DB is successful.

please notice that his would work if you need to always insert new data. if you need to UPDATE the target table , it would be more complex, and to provide some guidance I would need additional details.

hope this helps,

Answer

As was mentioned, the dataframe/JDBC supports INSERTs or overwriting entire datasets.

To do an UPDATE would require some work.  Basically, you would share the JDBC connection properties with the executors using a `broadcast` variable, then have a DataFrame that contains records requiring an update, perform a `foreachPartition` call on that DataFrame where you would then make a JDBC connection and then loop through the partition with a `foreach` call where you would perform the UPDATE.

You can find a Scala example here: https://medium.com/@thomaspt748/how-to-upsert-data-into-relational-database-using-spark-7d2d92e05bb9

Update postgres RDS table with AWS glue script

相关内容