Glue ETL Job with external connection to Redshift - filter then extract?

0

I have been attempting to ETL a data set from redshift with glue for data lake consumption. The redshift data set is very large and I only want to extract the last x days data each job run. When I setup the job the filter comes after the applymapping and before the resolve choice dataframes. When doing so the redshift query appears on the cluster as essentially a "Select *" . It seems the dataframe wants to load the entire redshift table into Glue then filter it, which is expensive and eventually fails. Is there a way to filter the datasource before the dataframe?

1 Answer
0
Accepted Answer

I have a workaround for the pushdown predicate using databricks Redshift driver. This requires some custom coding in Glue but have worked flawlessly in the past. My sample code is https://github.com/saunakc/etl-microservice-datalake/blob/master/src/glue/unload-table-part.py for reference.

AWS
answered 5 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions