Glue ETL Job with external connection to Redshift - filter then extract?

0

I have been attempting to ETL a data set from redshift with glue for data lake consumption. The redshift data set is very large and I only want to extract the last x days data each job run. When I setup the job the filter comes after the applymapping and before the resolve choice dataframes. When doing so the redshift query appears on the cluster as essentially a "Select *" . It seems the dataframe wants to load the entire redshift table into Glue then filter it, which is expensive and eventually fails. Is there a way to filter the datasource before the dataframe?

1 Resposta
0
Resposta aceita

I have a workaround for the pushdown predicate using databricks Redshift driver. This requires some custom coding in Glue but have worked flawlessly in the past. My sample code is https://github.com/saunakc/etl-microservice-datalake/blob/master/src/glue/unload-table-part.py for reference.

AWS
respondido há 5 anos

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas