By using AWS re:Post, you agree to the Terms of Use

ETL using AWS Glue

0

We are doing ETL operation from Redshift to Redshift using AWS Glue. We need to do incremental operation. We need to identify latest modified records from source and process only new and latest modified records(recent updates) from source to avoid reprocessing of millions of records.

We have following key columns in source:

  1. ID(primary key column),
  2. CreatedDate(timestamp in which record has created),
  3. UpdatedDate(timestamp in which record has updated. Initially, it will be null)

Is there any possibility to process only latest records, instead on re-processing all the records.

Much appreciated for possible solutions.

4 Answers
1

One of the options to identify incremental data is to create a view on top of the source table where the create or update timestamp is greater than the last job run time. You can create the job metadata table to capture the last job run time or any other information based on your requirement to process incremental data.

Since both source and target clusters are Redshift, you may want to explore Data Sharing, which is available in the Ra3 instance type and will allow you to read data directly from the source cluster. In that case, you do ELT in the source cluster and share live data with the target cluster. For additional information, please refer to

https://aws.amazon.com/redshift/features/data-sharing/ https://docs.aws.amazon.com/redshift/latest/dg/datashare-overview.html

answered 2 months ago
0

Unless I am missing something, I would not use AWS Glue for a Redshift to Redshift ETL process. Glue would be great for file based (likely parquet) data in S3. I would write a Redshift stored procedure and use a Step Function to parameterize and schedule it. For some info on how to incrementally process in Redshift, see this article: https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html

answered 2 months ago
0

AWS Glue jobs support bookmarks. You need to have a compound key with those key columns (I'd try only with ID and UpdatedDate first) and use it as job bookmark key in your ETL job.

EXPERT
answered 2 months ago
0

Is there a reason to ETL from one cluster to another?

If your clusters are using RA3 nodes, using data sharing between the clusters will remove any ETL requirements as you will get fast access to up to date, transactionally consistent data without needing to move data between your clusters.

If there are additional processing needed, you can use native stored procedures on top of datasharing to accomplish the task.

https://aws.amazon.com/redshift/features/data-sharing/

answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions