AWS glue Upsert operation for few columns

0

We are doing AWS Glue POC with data transformation from one database to another database in redshift using JDBC connection. We are doing upsert operation in target database, as we are new to glue, we need some assistance on addressing the following issue:

We are doing transformation for table with some columns, in which we need to insert/update only limited number of columns, so we did transformation only for those columns. While running the job, it throwing an error expecting the other column from the table.

For example, we have 6 records(col-1, col-2, ...col-6) in our redshift target table and glue data catalog. In this, we need to do transformation only for 3 columns(col-2, col-3, col-4) and from the remaining columns, one is auto generated column(col-1) and others are updated by different sources(col-5, col-6). So we applied mapping only for 3 columns which we required by this job. But it is giving issue like it is expecting the other column(col-1) which we doesn't want to update through this source. Could you please help us to handle this scenario

NOTE: This is working fine in case of insert only operation.
asked 2 years ago3396 views
1 Answer
0

Hi,

I imagine that you are using Glue Studio and checking the box Upsert in the target node.

AWS Glue implement the Upsert following Redshift best practices, using a pre and post sql , the whole flow is:

  • pre-sql creates a Staging table
  • insert all the new records in a staging table
  • post SQL
    • uses the key defined to first delete all rows from the target table that do exists in the staging table
    • insert all rows from the staging table in the target table
    • drop the staging table

based on this you can understand that you do not have to worry about the columns you do not want to update, just keep them unchanged and they will not be modified during the process. This will not have any performance impact.

hope this helps and is clear

AWS
EXPERT
answered 2 years ago
  • Thanks for your response. Adding more information about the issue.

    While creating staging table, glue creating it by using target table structure. So the staging table structure is col-1, col-2, ...col-6. As I did transformation only for col-2, col-3, col-4, while inserting into staging table, it throwing error like below:

    Error (code 1207) while loading data into Redshift: "Invalid digit, Value 'D', Pos 0, Type: Integer"
    Table name: prod.stage_table_f88249242d264f7a98e0605187781706
    Column name: col1
    Column type: int4(0)
    Raw line: Distributor,@NULL@,@NULL@
    Raw field value: Distributor 
    

    From the exception, it is expecting col1 in the transformation. So it is stopping us in the ETL operation. Much appreciated if we get solution for this.

  • Hi ,

    I would like to know how to do upsert operation in aws glue for non redshift databases using jdbc connection.

  • @AWS-User-2414105 Really sorry for such a late reply, I might have missed the notification of your comment.

    What I meant is that you should add the other columns to the transformation but leave them unchanged, in this way they will also be replaced in the last step but with the current value.

    hope this is clear now

  • @AWS-User-9029218 you could implement it using an external library to run some pure SQL statement, see this answer https://repost.aws/questions/QUQ6gsQY2CQgWfvdbwI8HmVg 2) you can implement something similar to what described in this external blog post: https://medium.com/@thomaspt748/how-to-upsert-data-into-relational-database-using-spark-7d2d92e05bb9

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions