ETL best practices: AWS Glue versus Redshift Scheduled queries

0

When I´m about to start an ETL Job, usually I ask some main questions:

  1. Where the original file/table is stored?
  2. What should I do to delivery data to my end goal?

If I have already all the data that i need in silver tables to construct a gold table, I prefer doing my ETL inside a scheduled query. But it seems like it is more difficult to monitore when compared to AWS Glue and it does not follow the same workflow that I have previously build in Glue jobs.

So my question is: which usually is the best path when it comes to that situation? Is it normal to run scheduled queries or is it best to mantain all ETL procedures inside AWS Glue, using its different functions?

If it is possible, I would like to understand pros and cons about choosing determined setup!

Julio
preguntada hace 6 meses370 visualizaciones
1 Respuesta
1
Respuesta aceptada

On most cases, if the data is already stored in Redshift and that's all the data it needs, it's better to do the processing there, rather than take the data out, process and put it back.
Even if you want to alleviate load from Redshift, the ratio of data moved versus computing done would have to be small to be worth it.

If you want to have visibility or include it as part of a workflow, you could have a Glue Shell issuing commands to Redshift (so while is waiting the Glue cost is very small compared with a cluster).

profile pictureAWS
EXPERTO
respondido hace 6 meses

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas