Can redshift automatically persist nested data structures just from s3 buckets?

0

The setup we currently have is a kafka cluster with various topics; 1 topic per data structure, and a default consumer that runs on all of these topics. The consumer simply takes the persisted data and saves it in the appropriate s3 bucket for that data structure. Currently we are converting from protobuf to parquet files.

Currently there is an import routine but it is bloated and somewhat over engineered, and requires individual mapping files for each data structure. I don't see this as being necessary particularly with the use of parquet / protobuf which already contain the schema + data. I believe it is entirely possible for redshift to import all of the data, even if it contains nested properties, and create appropriate tables as needed, and no human interaction required once the data in the s3 buckets exists.

I recently looked at your articles on Redshift Spectrum, & nested data structures there which was interesting as far as the support for nested data structures goes. However, the steps to create the table are a bit more manual than is ideal. I was looking to persist parquet files to the s3 bucket, and then those parquet files are handled by the redshift engine; if a given table doesn't exist there is sufficient information in those parquet files with which to create the required table (and so, table is created where necessary). ideally i would not have to go via Glue to do this; simply a series of parquet files in an s3 bucket would be sufficient with regards to either data inserts, or table creation.

Kafka is wonderfully scalable, and not just in terms of bytes / second, but also the number of data structures. It doesn't really work for us if we have to have a per topic mechanism that takes care of table creation / data upload (which is kind of what we have today, Redshift-wise - I wish to replace this with an ubiquitous, autonomous mechanism that simply calls COPY INTO, or something similar).

Suggestions welcome - but with my experience of Hadoop / Hive, all any engineer had to do was to write data to a topic and optionally submit a schema associated with that topic. As soon as valid data arrived, the engine would take care of the rest including the table creation / data merge / upload and I would like exactly the same for s3 / Redshift, as opposed to the laborious process we have today which ultimately requires teams to submit a data structure -> table mapping file which I hope to be superfluous, again, based on experience with a previous incarnation of a Kafka / Hadoop / Hive data platform.

2 Answers
0
Accepted Answer

One thing that might help you is using Glue crawlers to crawl the data and store the schema in Glue Data Catalog. This will provide two advantages for you :

  • You can create tables in Redshift referencing external schema stored in Glue Data Catalog (Create External Schema) this makes life easier and opens the door for automation for creating new tables and uploading data.
  • you can be informed by Glue Data Catalog Events whenever there are new schemas discovered by the crawlers so you can trigger actions using a combination of Eventbridge events and Lambda functions to create these new tables and load data.
AWS
answered 10 months ago
0

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions