Glue ETL Job with external connection to Redshift - filter then extract?

0

I have been attempting to ETL a data set from redshift with glue for data lake consumption. The redshift data set is very large and I only want to extract the last x days data each job run. When I setup the job the filter comes after the applymapping and before the resolve choice dataframes. When doing so the redshift query appears on the cluster as essentially a "Select *" . It seems the dataframe wants to load the entire redshift table into Glue then filter it, which is expensive and eventually fails. Is there a way to filter the datasource before the dataframe?

1개 답변
0
수락된 답변

I have a workaround for the pushdown predicate using databricks Redshift driver. This requires some custom coding in Glue but have worked flawlessly in the past. My sample code is https://github.com/saunakc/etl-microservice-datalake/blob/master/src/glue/unload-table-part.py for reference.

AWS
답변함 5년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인