ETL best practices: AWS Glue versus Redshift Scheduled queries

0

When I´m about to start an ETL Job, usually I ask some main questions:

  1. Where the original file/table is stored?
  2. What should I do to delivery data to my end goal?

If I have already all the data that i need in silver tables to construct a gold table, I prefer doing my ETL inside a scheduled query. But it seems like it is more difficult to monitore when compared to AWS Glue and it does not follow the same workflow that I have previously build in Glue jobs.

So my question is: which usually is the best path when it comes to that situation? Is it normal to run scheduled queries or is it best to mantain all ETL procedures inside AWS Glue, using its different functions?

If it is possible, I would like to understand pros and cons about choosing determined setup!

Julio
posta 6 mesi fa371 visualizzazioni
1 Risposta
1
Risposta accettata

On most cases, if the data is already stored in Redshift and that's all the data it needs, it's better to do the processing there, rather than take the data out, process and put it back.
Even if you want to alleviate load from Redshift, the ratio of data moved versus computing done would have to be small to be worth it.

If you want to have visibility or include it as part of a workflow, you could have a Glue Shell issuing commands to Redshift (so while is waiting the Glue cost is very small compared with a cluster).

profile pictureAWS
ESPERTO
con risposta 6 mesi fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande