ETL best practices: AWS Glue versus Redshift Scheduled queries

0

When I´m about to start an ETL Job, usually I ask some main questions:

  1. Where the original file/table is stored?
  2. What should I do to delivery data to my end goal?

If I have already all the data that i need in silver tables to construct a gold table, I prefer doing my ETL inside a scheduled query. But it seems like it is more difficult to monitore when compared to AWS Glue and it does not follow the same workflow that I have previously build in Glue jobs.

So my question is: which usually is the best path when it comes to that situation? Is it normal to run scheduled queries or is it best to mantain all ETL procedures inside AWS Glue, using its different functions?

If it is possible, I would like to understand pros and cons about choosing determined setup!

Julio
질문됨 6달 전370회 조회
1개 답변
1
수락된 답변

On most cases, if the data is already stored in Redshift and that's all the data it needs, it's better to do the processing there, rather than take the data out, process and put it back.
Even if you want to alleviate load from Redshift, the ratio of data moved versus computing done would have to be small to be worth it.

If you want to have visibility or include it as part of a workflow, you could have a Glue Shell issuing commands to Redshift (so while is waiting the Glue cost is very small compared with a cluster).

profile pictureAWS
전문가
답변함 6달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인