- Newest
- Most votes
- Most comments
Thank you for posting the question and the detailed explanation. I understand you would like to know if there is a way to have Redshift Spectrum tables always point to the latest export. The other options you can explore are as follows:
Option 1: After the latest snapshot is exported to S3 you can use AWS Glue crawler to create external tables . Please find the steps below:
-
Automate the process of exporting snapshot to S3 whenever a automated snapshot is created. For more information on how to set this up, please refer to this GitHub repository.
-
Create a crawler in AWS Glue to periodically crawl through new data stored in the S3 bucket. The Glue crawler crawls the bucket and add the data in the form of tables. Upon completion, the crawler creates or updates table in the Data Catalog.
- For more information on how to add glue crawler, please refer Adding an AWS Glue crawler
- Please refer here to understand how does a crawler determine when to create partitions
- Please refer here to understand Incremental Crawls in AWS Glue
- Then create the external tables in Amazon Redshift Spectrum using the AWS Glue Data catalog. Please refer here to understand how to create and query external tables in Redshift Spectrum.
Option 2: When a new snapshot is added to the S3 bucket, you can add this as a partition in your table and then query these partitions. A common practice is to partition the data based on time.
The following procedure describes how to partition your data:
- Store your data in folders in Amazon S3 according to your partition key.
- Create an external table and specify the partition key in the PARTITIONED BY clause.
- Add the partitions using ALTER TABLE… ADD PARTITION
For more information on Partitioning Redshift Spectrum external tables refer here
I went with option 1 (alter all the tables to point at the new s3 folder), but I have constantly changing databases. Devs are adding tables and columns and changing data types and stuff all the time. My script finds the differences between the s3 export schema file and the db column config, drops and adds columns and tables as necessary, then repoints the tables. Getting varbinary to play nice on long strings was the hardest part. Anywho, if it’s mysql u can try this, I’m hoping to transition there eventually cause it costs so much to export snapshots.
Relevant content
- asked a month ago
- asked 4 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago