Point Spectrum to latest export of RDS snapshot

1

Every day, we export to S3 the snapshot of an Aurora database using this feature.
Exports are saved by RDS to S3 under a prefix that includes the export identifier.
How can we have Redshift Spectrum tables always point to the latest export?
I thought of two solutions but I'm not too satisfied with them and was hoping there's a better way I'm missing:

  • After an export is completed, modify every external table so they point to the new path
  • Copy the Parquet files to a "latest/" prefix which the external table always point to
2 Answers
0

Thank you for posting the question and the detailed explanation. I understand you would like to know if there is a way to have Redshift Spectrum tables always point to the latest export. The other options you can explore are as follows:

Option 1: After the latest snapshot is exported to S3 you can use AWS Glue crawler to create external tables . Please find the steps below:

  1. Automate the process of exporting snapshot to S3 whenever a automated snapshot is created. For more information on how to set this up, please refer to this GitHub repository.

  2. Create a crawler in AWS Glue to periodically crawl through new data stored in the S3 bucket. The Glue crawler crawls the bucket and add the data in the form of tables. Upon completion, the crawler creates or updates table in the Data Catalog.

  • For more information on how to add glue crawler, please refer Adding an AWS Glue crawler
  • Please refer here to understand how does a crawler determine when to create partitions
  • Please refer here to understand Incremental Crawls in AWS Glue
  1. Then create the external tables in Amazon Redshift Spectrum using the AWS Glue Data catalog. Please refer here to understand how to create and query external tables in Redshift Spectrum.

Option 2: When a new snapshot is added to the S3 bucket, you can add this as a partition in your table and then query these partitions. A common practice is to partition the data based on time.

The following procedure describes how to partition your data:

  • Store your data in folders in Amazon S3 according to your partition key.
  • Create an external table and specify the partition key in the PARTITIONED BY clause.
  • Add the partitions using ALTER TABLE… ADD PARTITION

For more information on Partitioning Redshift Spectrum external tables refer here

AWS
SUPPORT ENGINEER
answered 3 years ago
0

I went with option 1 (alter all the tables to point at the new s3 folder), but I have constantly changing databases. Devs are adding tables and columns and changing data types and stuff all the time. My script finds the differences between the s3 export schema file and the db column config, drops and adds columns and tables as necessary, then repoints the tables. Getting varbinary to play nice on long strings was the hardest part. Anywho, if it’s mysql u can try this, I’m hoping to transition there eventually cause it costs so much to export snapshots.

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions