Hi everyone!
I'm facing with an hard task regarding AWS Redshift and Spectrum.
This is my case (I'm gonna use fake names explaining my issue):
I have a table on Redshift called my_values
with following columns:
- value_id: the id of value type - number
- value_offset: the specific offset of a value type - number
- start_time: the timestamp when record was saved on db - timestamp
- value: the effective value (could be a string or a number) - string
- sender_id: the id of the entity which send data to the db - string
This table contains a lot of data (~800k records/hour) so I want to add something that automatically select data older than a specified date and take them out of Redshift in order to have a quite fixed size of the database. Here the idea of using Spectrum.
In my mind, with Spectrum, I'd extract all old record and I'd export them into an s3 bucket. With spectrum I'd create an external table called for example my_old_values
and I could query this old data directly from s3.
These was my steps:
1. First attempt:
In order to unload data I executed this query:
UNLOAD ('SELECT *, DATE_TRUNC(''day'', start_time) as rounded_start_time FROM my_values') TO 's3://my-bucket/unloaded-data' IAM_ROLE 'arn:aws:iam::my-iam-role' PARTITION BY ( rounded_start_time, sender_id ) FORMAT PARQUET;
In this query I unload all data inside my_values partitioned by 2 columns: rounded_start_date (new column created in the select statement in order to not consider hours minutes and seconds, only partition based on day) and sender_id.
What I expected was finding in my bucket a set of folders called rounded_start_time=2024-01-01
and inside them another set of folders called sender_id=1234567
. In fact I found them correctly and inside sender_id folders there were all parquet files with tabel data.
The only problem was that parquet files were without 2 columns that I use for partition.
So all data inside parquet had this structure: {value_id: 1, value_offset: 0, value: 'foe', start_time: '2024-01-01T11:00:00.000Z} for example.
So this was not the right way.
2. Second attempt:
I re-executed the same unload query, with 1 change. I duplicated sender_id field in order to use one for partition and one to leave inside data (I was ok if rounded_start_time was used only for partition, I needed start_time field). So the new query was the following:
UNLOAD ('SELECT *, DATE_TRUNC(''day'', start_time) as rounded_start_time, sender_id as sender_id_index FROM my_values') TO 's3://my-bucket/unloaded-data' IAM_ROLE 'arn:aws:iam::my-iam-role' PARTITION BY ( rounded_start_time, sender_id_index ) INCLUDE FORMAT PARQUET;
So now I expected same folder structure as before (with sender_id_index=1234567
instead of sender_id
) and parquet files with all data I need, also sender_id.
In fact I found all them. Then I proceeded with the external table creation.
To do so I had to create an external schema inside my Redshift db with this command:
create external schema my_values_spectrum_schema from data catalog database 'my_values_spectrum' -- I had created a data catalog on AWS Glue before IAM_ROLE 'arn:aws:iam::my-iam-role'
And then I created an external table:
CREATE EXTERNAL TABLE my_values_spectrum_schema.my_old_values( sender_id VARCHAR(65535), value_id INTEGER, value_offset INTEGER, value VARCHAR(65535), start_time TIMESTAMP ) PARTITIONED BY (rounded_start_time TIMESTAMP, sender_id_index VARCHAR(65535)) STORED AS PARQUET LOCATION 's3://my-bucket/unloaded-data'
Creating the table I added same partitions used for unload operation, but when I tried to query data from that table, it returns no results.
I think that data was not correctly mapped with s3 bucket cause 2 columns used for partitioning on external table are not present inside parquet data so Spectrum cannot use them.
3. Third attempt:
I repeated only the external table process.
This time I only added 2 columns used for partitioning as common column, without acting a partition on external table, as this query:
CREATE EXTERNAL TABLE my_values_spectrum_schema.my_old_values( sender_id VARCHAR(65535), value_id INTEGER, value_offset INTEGER, value VARCHAR(65535), start_time TIMESTAMP, rounded_start_time TIMESTAMP, sender_id_index VARCHAR(65535) ) STORED AS PARQUET LOCATION 's3://my-bucket/unloaded-data'
The result was that when I query the table it responded with all results, but columns rounded_start_time and sender_id_index were not populated.
This is correct cause they weren't inside parquet data so they cannot be retrieved or mapped by Spectrum.
Questions:
So my issues are the following:
- How can I map the data from external table to s3 bucket mantaining partitions? so avoiding to scan all s3 bucket when I'm looking for data filtered by partition columns?
- Is it enough partitioning data on unload query and not do same thing on external data? if yes, how a select query knows which are partition column and can not scan all bucket?
- How can I check (then I found a way to partition correctly the table) if a select query is scanning all bucket or not?
- I'm making a huge mistake and I should use other service to do what I want (Glue, other services...)?
Thank you so much for your time reading and answering to my issue and thank you for helping :D
I'm trying to follow the example you explained in this post. I have not clear how to query data at the end in the view. If you include 'WITH NO SCHEMA BINDING' clause creating the view, you haven't connect your external schema to your s3 bucket. In fact when I try to query data from view I receive this error: ERROR: AwsClientException: EntityNotFoundException from glue - Entity Not Found [ErrorId: 1-6627d158-2db4599c65a545474fac2823] have you also created an external table without mention it on the post?
Blog post has provided link on how to create external schema. This article has detailed steps,please see if that helps. https://repost.aws/knowledge-center/redshift-spectrum-external-table
yeah my point is: have you created it with same partitions of unload? or I have only to create an external table with no partitions and spectrum use s3 partitions?