ETL best practices: AWS Glue versus Redshift Scheduled queries

0

When I´m about to start an ETL Job, usually I ask some main questions:

  1. Where the original file/table is stored?
  2. What should I do to delivery data to my end goal?

If I have already all the data that i need in silver tables to construct a gold table, I prefer doing my ETL inside a scheduled query. But it seems like it is more difficult to monitore when compared to AWS Glue and it does not follow the same workflow that I have previously build in Glue jobs.

So my question is: which usually is the best path when it comes to that situation? Is it normal to run scheduled queries or is it best to mantain all ETL procedures inside AWS Glue, using its different functions?

If it is possible, I would like to understand pros and cons about choosing determined setup!

Julio
asked a year ago696 views
1 Answer
1
Accepted Answer

On most cases, if the data is already stored in Redshift and that's all the data it needs, it's better to do the processing there, rather than take the data out, process and put it back.
Even if you want to alleviate load from Redshift, the ratio of data moved versus computing done would have to be small to be worth it.

If you want to have visibility or include it as part of a workflow, you could have a Glue Shell issuing commands to Redshift (so while is waiting the Glue cost is very small compared with a cluster).

profile pictureAWS
EXPERT
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions