Update postgres RDS table with AWS glue script

3

I would like to transfer data from postgres RDS database tables to a new reporting database, also created as a postgres RDS. I created data catalogs and created a script that should join two tables together and then save data to a reporting database. It is working as intended only when run for the first time - it saves current data to new database. I need to update the reporting database daily with newly added records to a basic database, but it is not saving new data for me. Is there any way to insert only new data into a db with AWS Glue?

2 Answers
0

Hi,

could you please confirm if I understood well the scenario?

You have one AWS Glue job (Spark job) which:

  1. reads 2 tables from an RDS database
  2. joins the 2 tables
  3. writes to a third table in a different Database
  4. you only want to process incremental data
  5. you have enabled Job Bookmarks for the job?

AWS Glue when you write to a jdbc database only INSERT Data, if you want to capture the new data you are going to enable the job bookmarks. Job bookmark if enabled will apply to all table read in the job and if you are using a jdbc connection will use by default the primary key to check for new data, unless you specify a different key.

If the two tables have different update schedules on the source it may happen that the join results in an empty datasets and no additional data would be inserted in the target DB.

If you need to join the incremental data from Table A with the whole of table B, you would need to break the job and create a workflow.

  1. Job 1 reads the full of table B (job bookmark disabled) and writes it to a temporary file on S3.

  2. Job 2 reads the table A with bookmark enabled and the temporary data set from S3, it makes the join and writes to the target DB

  3. job 2 removes the temporary file if the write to target DB is successful.

please notice that his would work if you need to always insert new data. if you need to UPDATE the target table , it would be more complex, and to provide some guidance I would need additional details.

hope this helps,

AWS
EXPERT
answered 2 years ago
  • I am using RDS databases added to a Data Catalog as a source and target. I already have Job Bookmarks enabled for the job, but it doesn't help.

  • @AWS-User-125985: hi what I was trying to explain is that actually the Job Bookmarks may be the reason why you are not able to see new data on the target table.

    Job bookmarks work at a job level, so if you have 2 tables in the job that you are reading from, and these tables have different update schedule, when the job read them it is not said it will read matching record and the result of the join might be empty, hence no new data to write to the target.

    This is why I was telling you that you might need to split the job

  • Thank you for your response. What I would like to achieve is to have a job that performs UPDATE. Could you help me with that?

0

As was mentioned, the dataframe/JDBC supports INSERTs or overwriting entire datasets.

To do an UPDATE would require some work. Basically, you would share the JDBC connection properties with the executors using a broadcast variable, then have a DataFrame that contains records requiring an update, perform a foreachPartition call on that DataFrame where you would then make a JDBC connection and then loop through the partition with a foreach call where you would perform the UPDATE.

You can find a Scala example here: https://medium.com/@thomaspt748/how-to-upsert-data-into-relational-database-using-spark-7d2d92e05bb9

AWS
Don_D
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions