Glue ETL Job with external connection to Redshift - filter then extract?

0

I have been attempting to ETL a data set from redshift with glue for data lake consumption. The redshift data set is very large and I only want to extract the last x days data each job run. When I setup the job the filter comes after the applymapping and before the resolve choice dataframes. When doing so the redshift query appears on the cluster as essentially a "Select *" . It seems the dataframe wants to load the entire redshift table into Glue then filter it, which is expensive and eventually fails. Is there a way to filter the datasource before the dataframe?

已提問 5 年前檢視次數 475 次
1 個回答
0
已接受的答案

I have a workaround for the pushdown predicate using databricks Redshift driver. This requires some custom coding in Glue but have worked flawlessly in the past. My sample code is https://github.com/saunakc/etl-microservice-datalake/blob/master/src/glue/unload-table-part.py for reference.

AWS
已回答 5 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南