I want to create and query an external table with Amazon Redshift Spectrum.
Resolution
To query data from Amazon Simple Storage Service (Amazon S3), you don't need to load data into Amazon Redshift tables. When Redshift Spectrum is processing queries, the data remains in your S3 bucket.
Important: Make sure that your Amazon Redshift cluster and S3 bucket are in the same AWS Region.
To create an external table with Redshift Spectrum, complete the following steps:
-
Create an AWS Identity and Access Management (IAM) role for Amazon Redshift.
-
Attach the following IAM policies to grant Amazon Redshift the required permissions to access your data catalog:
If you use AWS Glue Data Catalog, then attach the AmazonS3ReadOnlyAccess and AWSGlueConsoleFullAccess IAM policies to your role.
If you use Amazon Athena Data Catalog, then attach the AmazonAthenaFullAccess IAM policy to your role.
-
Associate the IAM role with your Amazon Redshift cluster.
-
Create the external schema.
-
Download the data files and upload them to an S3 bucket in your Region. The following example uses the TICKIT sample data files for S3:
s3://<bucket_name>/tickit/spectrum/event/' and 's3://bucket_name/tickit/spectrum/sales/
Note: Replace bucket name with your S3 bucket's name.
-
Create an external table. The following example creates an external table for your EVENT data:
create external table spectrum.event(
eventid integer,
venueid smallint,
catid smallint,
dateid smallint,
eventname varchar(200),
starttime timestamp
)
row format delimited
fields terminated by '|'
stored as textfile
location 's3://bucket_name/tickit/spectrum/event/';
Note: Replace bucket name with your S3 bucket's name. To use AWS Glue to create an external table, be sure to add table definitions to your Data Catalog.
To use Amazon Athena to create an external table, add table definitions.
Example table definitions with Athena:
CREATE EXTERNAL TABLE spectrum.event (
eventid int,
venueid smallint,
catid smallint,
dateid smallint,
eventname varchar(max),
starttime timestamp)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
-
To view all external tables that your external schema references, run the following query for SVV_EXTERNAL_TABLES:
select schemaname , tablename , location from svv_external_tables where schemaname = 'spectrum';
schemaname | tablename | location
----------------+---------------------------------+-----------------------------------------------------------------------
spectrum | event | s3://bucket-name/file-location
Note: Replace bucket name with your S3 bucket's name and file location with your file location.
-
Query the external tables as external Redshift Spectrum tables. Use the SELECT statement:
select top 3 spectrum.sales.eventid, sum(spectrum.sales.pricepaid) from spectrum.sales, spectrum.event
where spectrum.sales.eventid = spectrum.event.eventid
and spectrum.sales.pricepaid > 30
group by spectrum.sales.eventid
order by 2 desc;
eventid | sum
---------+----------
289 | 51846.00
7895 | 51049.00
1602 | 50301.00
The preceding example query joins the external SALES table with an external EVENT table.
Related information
Amazon Redshift Spectrum
Query troubleshooting in Amazon Redshift Spectrum