ETL best practices: AWS Glue versus Redshift Scheduled queries

0

When I´m about to start an ETL Job, usually I ask some main questions:

  1. Where the original file/table is stored?
  2. What should I do to delivery data to my end goal?

If I have already all the data that i need in silver tables to construct a gold table, I prefer doing my ETL inside a scheduled query. But it seems like it is more difficult to monitore when compared to AWS Glue and it does not follow the same workflow that I have previously build in Glue jobs.

So my question is: which usually is the best path when it comes to that situation? Is it normal to run scheduled queries or is it best to mantain all ETL procedures inside AWS Glue, using its different functions?

If it is possible, I would like to understand pros and cons about choosing determined setup!

Julio
已提问 6 个月前370 查看次数
1 回答
1
已接受的回答

On most cases, if the data is already stored in Redshift and that's all the data it needs, it's better to do the processing there, rather than take the data out, process and put it back.
Even if you want to alleviate load from Redshift, the ratio of data moved versus computing done would have to be small to be worth it.

If you want to have visibility or include it as part of a workflow, you could have a Glue Shell issuing commands to Redshift (so while is waiting the Glue cost is very small compared with a cluster).

profile pictureAWS
专家
已回答 6 个月前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则