Unload data from Redshift and query them using Redshift Spectrum

0

Hi everyone! I'm facing with an hard task regarding AWS Redshift and Spectrum.

This is my case (I'm gonna use fake names explaining my issue): I have a table on Redshift called my_values with following columns:

  1. value_id: the id of value type - number
  2. value_offset: the specific offset of a value type - number
  3. start_time: the timestamp when record was saved on db - timestamp
  4. value: the effective value (could be a string or a number) - string
  5. sender_id: the id of the entity which send data to the db - string

This table contains a lot of data (~800k records/hour) so I want to add something that automatically select data older than a specified date and take them out of Redshift in order to have a quite fixed size of the database. Here the idea of using Spectrum. In my mind, with Spectrum, I'd extract all old record and I'd export them into an s3 bucket. With spectrum I'd create an external table called for example my_old_values and I could query this old data directly from s3.

These was my steps:

1. First attempt: In order to unload data I executed this query:

UNLOAD ('SELECT *, DATE_TRUNC(''day'', start_time) as rounded_start_time FROM my_values') TO 's3://my-bucket/unloaded-data' IAM_ROLE 'arn:aws:iam::my-iam-role' PARTITION BY ( rounded_start_time, sender_id ) FORMAT PARQUET;

In this query I unload all data inside my_values partitioned by 2 columns: rounded_start_date (new column created in the select statement in order to not consider hours minutes and seconds, only partition based on day) and sender_id. What I expected was finding in my bucket a set of folders called rounded_start_time=2024-01-01 and inside them another set of folders called sender_id=1234567. In fact I found them correctly and inside sender_id folders there were all parquet files with tabel data. The only problem was that parquet files were without 2 columns that I use for partition. So all data inside parquet had this structure: {value_id: 1, value_offset: 0, value: 'foe', start_time: '2024-01-01T11:00:00.000Z} for example. So this was not the right way.

2. Second attempt: I re-executed the same unload query, with 1 change. I duplicated sender_id field in order to use one for partition and one to leave inside data (I was ok if rounded_start_time was used only for partition, I needed start_time field). So the new query was the following:

UNLOAD ('SELECT *, DATE_TRUNC(''day'', start_time) as rounded_start_time, sender_id as sender_id_index FROM my_values') TO 's3://my-bucket/unloaded-data' IAM_ROLE 'arn:aws:iam::my-iam-role' PARTITION BY ( rounded_start_time, sender_id_index ) INCLUDE FORMAT PARQUET;

So now I expected same folder structure as before (with sender_id_index=1234567 instead of sender_id) and parquet files with all data I need, also sender_id. In fact I found all them. Then I proceeded with the external table creation.

To do so I had to create an external schema inside my Redshift db with this command:

create external schema my_values_spectrum_schema from data catalog database 'my_values_spectrum' -- I had created a data catalog on AWS Glue before IAM_ROLE 'arn:aws:iam::my-iam-role'

And then I created an external table:

CREATE EXTERNAL TABLE my_values_spectrum_schema.my_old_values( sender_id VARCHAR(65535), value_id INTEGER, value_offset INTEGER, value VARCHAR(65535), start_time TIMESTAMP ) PARTITIONED BY (rounded_start_time TIMESTAMP, sender_id_index VARCHAR(65535)) STORED AS PARQUET LOCATION 's3://my-bucket/unloaded-data'

Creating the table I added same partitions used for unload operation, but when I tried to query data from that table, it returns no results. I think that data was not correctly mapped with s3 bucket cause 2 columns used for partitioning on external table are not present inside parquet data so Spectrum cannot use them.

3. Third attempt: I repeated only the external table process. This time I only added 2 columns used for partitioning as common column, without acting a partition on external table, as this query:

CREATE EXTERNAL TABLE my_values_spectrum_schema.my_old_values( sender_id VARCHAR(65535), value_id INTEGER, value_offset INTEGER, value VARCHAR(65535), start_time TIMESTAMP, rounded_start_time TIMESTAMP, sender_id_index VARCHAR(65535) ) STORED AS PARQUET LOCATION 's3://my-bucket/unloaded-data'

The result was that when I query the table it responded with all results, but columns rounded_start_time and sender_id_index were not populated. This is correct cause they weren't inside parquet data so they cannot be retrieved or mapped by Spectrum.

Questions: So my issues are the following:

  1. How can I map the data from external table to s3 bucket mantaining partitions? so avoiding to scan all s3 bucket when I'm looking for data filtered by partition columns?
  2. Is it enough partitioning data on unload query and not do same thing on external data? if yes, how a select query knows which are partition column and can not scan all bucket?
  3. How can I check (then I found a way to partition correctly the table) if a select query is scanning all bucket or not?
  4. I'm making a huge mistake and I should use other service to do what I want (Glue, other services...)?

Thank you so much for your time reading and answering to my issue and thank you for helping :D

MicZatt
asked 5 months ago398 views
1 Answer
2
Accepted Answer

If you are using Redshift Ra3 instance. When using the RA3 instance type in Amazon Redshift, compute and storage resources can scale independently. The storage management cost in Redshift is nearly the same as the S3 Frequent Access layer. Instead of unloading data to S3 and using Redshift Spectrum (which incurs an additional $5 per TB scanned cost), you can consider an alternative approach:

  1. Create a "History" schema within your Redshift cluster.
  2. Insert data into the History tables.
  3. Create a Union view that combines the data from the Current tables and the Historical data. This approach can be more cost-effective compared to the unload-to-S3-and-use-Spectrum method, as it eliminates the additional costs associated with Redshift Spectrum.

If you are using Amazon Redshift DC2 nodes and would like to have better control over your data storage, you can consider the spectrum approach which you have mentioned earlier:

To automate the process of archiving data from Amazon Redshift to Amazon S3, you can refer to the blog post https://aws.amazon.com/blogs/big-data/automate-data-archival-for-amazon-redshift-time-series-tables/

AWS
EXPERT
Nita_S
answered 5 months ago
profile picture
EXPERT
reviewed 3 months ago
profile picture
EXPERT
reviewed 5 months ago
  • I'm trying to follow the example you explained in this post. I have not clear how to query data at the end in the view. If you include 'WITH NO SCHEMA BINDING' clause creating the view, you haven't connect your external schema to your s3 bucket. In fact when I try to query data from view I receive this error: ERROR: AwsClientException: EntityNotFoundException from glue - Entity Not Found [ErrorId: 1-6627d158-2db4599c65a545474fac2823] have you also created an external table without mention it on the post?

  • Blog post has provided link on how to create external schema. This article has detailed steps,please see if that helps. https://repost.aws/knowledge-center/redshift-spectrum-external-table

  • yeah my point is: have you created it with same partitions of unload? or I have only to create an external table with no partitions and spectrum use s3 partitions?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions