Update postgres RDS table with AWS glue script

3

I would like to transfer data from postgres RDS database tables to a new reporting database, also created as a postgres RDS. I created data catalogs and created a script that should join two tables together and then save data to a reporting database. It is working as intended only when run for the first time - it saves current data to new database. I need to update the reporting database daily with newly added records to a basic database, but it is not saving new data for me. Is there any way to insert only new data into a db with AWS Glue?

已提问 2 年前3849 查看次数
2 回答
0

Hi,

could you please confirm if I understood well the scenario?

You have one AWS Glue job (Spark job) which:

  1. reads 2 tables from an RDS database
  2. joins the 2 tables
  3. writes to a third table in a different Database
  4. you only want to process incremental data
  5. you have enabled Job Bookmarks for the job?

AWS Glue when you write to a jdbc database only INSERT Data, if you want to capture the new data you are going to enable the job bookmarks. Job bookmark if enabled will apply to all table read in the job and if you are using a jdbc connection will use by default the primary key to check for new data, unless you specify a different key.

If the two tables have different update schedules on the source it may happen that the join results in an empty datasets and no additional data would be inserted in the target DB.

If you need to join the incremental data from Table A with the whole of table B, you would need to break the job and create a workflow.

  1. Job 1 reads the full of table B (job bookmark disabled) and writes it to a temporary file on S3.

  2. Job 2 reads the table A with bookmark enabled and the temporary data set from S3, it makes the join and writes to the target DB

  3. job 2 removes the temporary file if the write to target DB is successful.

please notice that his would work if you need to always insert new data. if you need to UPDATE the target table , it would be more complex, and to provide some guidance I would need additional details.

hope this helps,

AWS
专家
已回答 2 年前
  • I am using RDS databases added to a Data Catalog as a source and target. I already have Job Bookmarks enabled for the job, but it doesn't help.

  • @AWS-User-125985: hi what I was trying to explain is that actually the Job Bookmarks may be the reason why you are not able to see new data on the target table.

    Job bookmarks work at a job level, so if you have 2 tables in the job that you are reading from, and these tables have different update schedule, when the job read them it is not said it will read matching record and the result of the join might be empty, hence no new data to write to the target.

    This is why I was telling you that you might need to split the job

  • Thank you for your response. What I would like to achieve is to have a job that performs UPDATE. Could you help me with that?

0

As was mentioned, the dataframe/JDBC supports INSERTs or overwriting entire datasets.

To do an UPDATE would require some work. Basically, you would share the JDBC connection properties with the executors using a broadcast variable, then have a DataFrame that contains records requiring an update, perform a foreachPartition call on that DataFrame where you would then make a JDBC connection and then loop through the partition with a foreach call where you would perform the UPDATE.

You can find a Scala example here: https://medium.com/@thomaspt748/how-to-upsert-data-into-relational-database-using-spark-7d2d92e05bb9

AWS
Don_D
已回答 2 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则