Redshift data warehouse and Glue ETL design recommendations

0

An AWS Partner is designing a data late/warehouse system for a security-conscious customer. The partner has the following questions related to designing a Redshift data warehouse and related ETL pipelines. I wanted to seek your help to validate my understanding on the following:

1. In general, what is the recommended approach to do incremental loads/CDC from source systems to Amazon Redshift? 
2. Is it recommended to use AWS Glue to also export data from Amazon Redshift (hot data) to Amazon S3 (cold data)?
3. When do we need to run updates to the AWS Glue data catalogs? Is it only when the source table definition is changed? Any other scenarios?
4. Is there a scheduling / workflow mechanism to define loading dependencies into Amazon Redshift?
5. Is there a mechanism in Amazon Redshift to allow for retries if the data load fails?  E.g. dimension data needs to be loaded completely before fact data table loads starts? Is AWS Glue workflow the suggested tool here as well?
6. Can Tableau be used to query data from both Amazon Redshift and Amazon Redshift Spectrum at the same time?
7. Would there be any recommended sharable resource, such as a presentation with best practices and approach for designing a data warehouse in AWS?

Any insights on any the above would be highly appreciated.

Thank you.

asked 2 years ago209 views
1 Answer
0
Accepted Answer

1.In general, what is the recommended approach to do incremental loads/CDC from source systems to AWS Redshift?

It depends on the type of source system. For relational databases DMS can continuously replicate changed data. For files on S3, Glue jobs have a Bookmark feature that stores a marker of the most recently loaded data e.g. a timestamp or primary key value.

2.Is it recommended to use AWS Glue to also export data from Redshift (hot data) to S3 (cold data)?

There are two main methods to export data from Redshift to S3. Use the UNLOAD command, or INSERT data into S3 using a Spectrum external table.

3.When do we need to run updates to the Glue data catalogs? Is it only when the source table definition is changed? Any other scenarios?

Glue Crawlers can be run to update metadata in the Glue Data Catalog when the structure of a table has changed e.g. a column is added/dropped, and also when partitions have been added to the table. The Glue API can also be used for this purpose, and doesn’t incur the Crawler cost.

4.Is there a scheduling / workflow mechanism to define loading dependencies into Redshift?

This can be done in a number of ways. Step Functions and Glue Workflows can be used. Also, Redshift now has a built in scheduler.

5.Is there a mechanism in Redshift to allow for retries if the data load fails? E.g. dimension data needs to be loaded completely before fact data table loads starts? Is AWS Glue workflow the suggested tool here as well?

An orchestration tool can be used for this purpose e.g. Step Functions or Glue Workflows. An alternate, if the transformations are built with stored procedures, is to orchestrate individual loading procedures (for dimensions or facts) with a central loading procedure.

6.Can Tableau be used to query data from both Redshift and Redshift spectrum at the same time?

Yes. S3 objects are exposed in Spectrum via external tables in Redshift. Redshift external tables appear and behave just like normal internal tables. Views can be created that join internal tables with external tables and return hot (internal) and cold (external) data in a single view.

7.Would there be any recommended sharable resource, such as a presentation with best practices and approach for designing a data warehouse in AWS?

The Redshift Database Developer Guide documentation has a section that is useful for low level Redshift best practices.

Adam_G
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions