Update postgres RDS table with AWS glue script
I would like to transfer data from postgres RDS database tables to a new reporting database, also created as a postgres RDS. I created data catalogs and created a script that should join two tables together and then save data to a reporting database. It is working as intended only when run for the first time - it saves current data to new database. I need to update the reporting database daily with newly added records to a basic database, but it is not saving new data for me. Is there any way to insert only new data into a db with AWS Glue?
Hi,
could you please confirm if I understood well the scenario?
You have one AWS Glue job (Spark job) which:
- reads 2 tables from an RDS database
- joins the 2 tables
- writes to a third table in a different Database
- you only want to process incremental data
- you have enabled Job Bookmarks for the job?
AWS Glue when you write to a jdbc database only INSERT Data, if you want to capture the new data you are going to enable the job bookmarks. Job bookmark if enabled will apply to all table read in the job and if you are using a jdbc connection will use by default the primary key to check for new data, unless you specify a different key.
If the two tables have different update schedules on the source it may happen that the join results in an empty datasets and no additional data would be inserted in the target DB.
If you need to join the incremental data from Table A with the whole of table B, you would need to break the job and create a workflow.
Job 1 reads the full of table B (job bookmark disabled) and writes it to a temporary file on S3.
Job 2 reads the table A with bookmark enabled and the temporary data set from S3, it makes the join and writes to the target DB
job 2 removes the temporary file if the write to target DB is successful.
please notice that his would work if you need to always insert new data. if you need to UPDATE the target table , it would be more complex, and to provide some guidance I would need additional details.
hope this helps,
@AWS-User-125985: hi what I was trying to explain is that actually the Job Bookmarks may be the reason why you are not able to see new data on the target table.
Job bookmarks work at a job level, so if you have 2 tables in the job that you are reading from, and these tables have different update schedule, when the job read them it is not said it will read matching record and the result of the join might be empty, hence no new data to write to the target.
This is why I was telling you that you might need to split the job
Thank you for your response. What I would like to achieve is to have a job that performs UPDATE. Could you help me with that?
As was mentioned, the dataframe/JDBC supports INSERTs or overwriting entire datasets.
To do an UPDATE would require some work. Basically, you would share the JDBC connection properties with the executors using a broadcast
variable, then have a DataFrame that contains records requiring an update, perform a foreachPartition
call on that DataFrame where you would then make a JDBC connection and then loop through the partition with a foreach
call where you would perform the UPDATE.
You can find a Scala example here: https://medium.com/@thomaspt748/how-to-upsert-data-into-relational-database-using-spark-7d2d92e05bb9
Relevant questions
Connect to RDS postgres database in GLUE script
asked 2 months agoFATAL: password authentication failed for user "postgres
asked 4 years agodoes AWS DMS support partition tables from postgres
asked 5 months agoLZ4 compression on RDS Postgres 14
Accepted Answerasked 4 months agoUpdate postgres RDS table with AWS glue script
asked 2 months agoIssues connecting to Postgres RDS instance from within a VPC
asked 3 years agoUse RDS Postgres Replicas as a cluster
Accepted Answerasked 4 months ago60K updates to RDS Postgres - performance challenge
Accepted Answerasked 2 years agoHow to create SSL dblink connection from RDS to postgres on premise?
asked 4 months agoHow to restrict database users for RDS Postgres using AWS Managed AD trusted with customer on-prem AD
Accepted Answerasked 2 years ago
I am using RDS databases added to a Data Catalog as a source and target. I already have Job Bookmarks enabled for the job, but it doesn't help.