AWS Glue filter does not filter data

0

I am trying to create an ETL where I need to bring in data from redshift tables but the dataset is too large and I need to filter it before applying transformations on it . Glue filter node and the SQL query option does not filter data according to the requirement . The job keeps running for a long time and then fails , possibly due to the size of data . It seems that Glue is brining in all the data and then tries to apply the filter but before the filter is applied , the job fails . Is there a way to only bring in filtered data from redshift and then apply transformation on it ?

aneeq10
asked a year ago623 views
1 Answer
1

One of the ways to handle extracting large amounts of data or testing our ETL transformation/extracts is create a sample set of data that is small. Redshift supports views. Therefore, consider creating a view on the structure with a filter criteria that can bring let's say 100k records. Then use that view for the ETL source and add transformation to validate the ETL works. Then increase the size of the records in the view or have multiple views handling smaller sets of data to complete the work. This is one of the ways you can handle the large volumes of data. Use this link to create a view and it may require some elevated privileges, so work with your admin in this regards. https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_VIEW.html

Another option is to use custom JDBC connection to Redshift database and use SQL query to filter the data before bringing it into Glue. This may be slower but gives you more control.

Other complex and expensive option is to utilize EMR to run a Spark job that connects to Redshift, filters the data, and applies transformations on it. It may require setup and configuration.

AWS
answered a year ago
  • Thank you for the answer . I'm curious about the second option you suggested . Are you suggesting that I create a custom connection using scripting and then execute the query directly from there ?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions