Glue ETL Job with external connection to Redshift - filter then extract?

0

I have been attempting to ETL a data set from redshift with glue for data lake consumption. The redshift data set is very large and I only want to extract the last x days data each job run. When I setup the job the filter comes after the applymapping and before the resolve choice dataframes. When doing so the redshift query appears on the cluster as essentially a "Select *" . It seems the dataframe wants to load the entire redshift table into Glue then filter it, which is expensive and eventually fails. Is there a way to filter the datasource before the dataframe?

1 Antwort
0
Akzeptierte Antwort

I have a workaround for the pushdown predicate using databricks Redshift driver. This requires some custom coding in Glue but have worked flawlessly in the past. My sample code is https://github.com/saunakc/etl-microservice-datalake/blob/master/src/glue/unload-table-part.py for reference.

AWS
beantwortet vor 5 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen