- Newest
- Most votes
- Most comments
One of the ways to handle extracting large amounts of data or testing our ETL transformation/extracts is create a sample set of data that is small. Redshift supports views. Therefore, consider creating a view on the structure with a filter criteria that can bring let's say 100k records. Then use that view for the ETL source and add transformation to validate the ETL works. Then increase the size of the records in the view or have multiple views handling smaller sets of data to complete the work. This is one of the ways you can handle the large volumes of data. Use this link to create a view and it may require some elevated privileges, so work with your admin in this regards. https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_VIEW.html
Another option is to use custom JDBC connection to Redshift database and use SQL query to filter the data before bringing it into Glue. This may be slower but gives you more control.
Other complex and expensive option is to utilize EMR to run a Spark job that connects to Redshift, filters the data, and applies transformations on it. It may require setup and configuration.
Relevant content
- asked 2 years ago
- AWS OFFICIALUpdated 8 months ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
Thank you for the answer . I'm curious about the second option you suggested . Are you suggesting that I create a custom connection using scripting and then execute the query directly from there ?