AWS Glue filter does not filter data

0

I am trying to create an ETL where I need to bring in data from redshift tables but the dataset is too large and I need to filter it before applying transformations on it . Glue filter node and the SQL query option does not filter data according to the requirement . The job keeps running for a long time and then fails , possibly due to the size of data . It seems that Glue is brining in all the data and then tries to apply the filter but before the filter is applied , the job fails . Is there a way to only bring in filtered data from redshift and then apply transformation on it ?

aneeq10
질문됨 일 년 전645회 조회
1개 답변
1

One of the ways to handle extracting large amounts of data or testing our ETL transformation/extracts is create a sample set of data that is small. Redshift supports views. Therefore, consider creating a view on the structure with a filter criteria that can bring let's say 100k records. Then use that view for the ETL source and add transformation to validate the ETL works. Then increase the size of the records in the view or have multiple views handling smaller sets of data to complete the work. This is one of the ways you can handle the large volumes of data. Use this link to create a view and it may require some elevated privileges, so work with your admin in this regards. https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_VIEW.html

Another option is to use custom JDBC connection to Redshift database and use SQL query to filter the data before bringing it into Glue. This may be slower but gives you more control.

Other complex and expensive option is to utilize EMR to run a Spark job that connects to Redshift, filters the data, and applies transformations on it. It may require setup and configuration.

AWS
답변함 일 년 전
  • Thank you for the answer . I'm curious about the second option you suggested . Are you suggesting that I create a custom connection using scripting and then execute the query directly from there ?

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠