How do I create and query an external table in Amazon Redshift Spectrum?

3 分的閱讀內容
0

I'm trying to create and query an external table in Amazon Redshift Spectrum. How can I do this?

Resolution

With Amazon Redshift Spectrum, you can query data from Amazon Simple Storage Service (Amazon S3) without having to load data into Amazon Redshift tables. Amazon Redshift Spectrum processes any queries while the data remains in your Amazon S3 bucket.

Important: Before you begin, check whether Amazon Redshift is authorized to access your S3 bucket and any external data catalogs. Additionally, your Amazon Redshift cluster and S3 bucket must be in the same AWS Region.

To create an external table in Amazon Redshift Spectrum, perform the following steps:

1.    Create an AWS Identity and Access Management (IAM) role for Amazon Redshift.

2.    Attach your IAM policy:
If you're using AWS Glue Data Catalog, attach the AmazonS3ReadOnlyAccess and AWSGlueConsoleFullAccess IAM policies to your role.
If you're using Amazon Athena Data Catalog, attach the AmazonAthenaFullAccess IAM policy to your role.

3.    Associate the IAM role to the Amazon Redshift cluster.

4.    Create the external schema. The external schema references a database in the external data catalog. The external schema also provides the IAM role with an Amazon Resource Name (ARN) that authorizes Amazon Redshift access to S3.

In the following example, we use sample data files from S3 (tickitdb.zip). Unzip and load the individual files to an S3 bucket in your AWS Region like this:

s3://<bucket_name>/tickit/spectrum/event/' and 's3://<bucket_name>/tickit/spectrum/sales/

You can create an external schema using a command similar to the one below:

create external schema spectrum
from data catalog   
database 'spectrumdb'
iam_role 'arn:aws:iam::123456789012:role/mySpectrumRole'
create external database if not exists;

Note: Replace the ARN of the IAM role with the ARN that you created. Be sure to specify the name of the external database (such as "spectrumdb") for the database parameter.

5.    Create an external table. For example, you can create an external table for your EVENT data like this:

create external table spectrum.event(     eventid integer,
         venueid smallint,
         catid smallint,
         dateid smallint,
         eventname varchar(200),
         starttime timestamp)row format delimitedfields terminated by '|'stored as textfile location 's3://<bucket_name>/tickit/spectrum/event/';

For more information about external tables, see Creating external tables for Redshift Spectrum.

To create an external table using AWS Glue, be sure to add table definitions to your AWS Glue Data Catalog. You can add table definitions in your AWS Glue Data Catalog in several ways. For more information about adding table definitions, see Working with tables on the AWS Glueconsole.

To create an external table using Amazon Athena, add table definitions like this:

CREATE EXTERNAL TABLE `spectrumdb.event`(  `eventid` int, 
      `venueid` smallint, 
      `catid` smallint, 
      `dateid` smallint, 
      `eventname` string, 
      `starttime` timestamp)
    ROW FORMAT DELIMITED 
      FIELDS TERMINATED BY '|' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'LOCATION  's3://<bucket_name>/tickit/spectrum/event';

6.    Run the following query for SVV_EXTERNAL_TABLES to view all external tables referenced by your external schema:

select schemaname , tablename , location from svv_external_tables where schemaname = 'spectrum';

schemaname | tablename | location
----------------+---------------------------------+-----------------------------------------------------------------------
spectrum | event | s3://<bucket-name>/<file-location>

7.    Query the external tables (as external Amazon Redshift Spectrum tables) using a SELECT statement:

select top 3 spectrum.sales.eventid, sum(spectrum.sales.pricepaid) from spectrum.sales, spectrum.event
where spectrum.sales.eventid = spectrum.event.eventid
and spectrum.sales.pricepaid > 30
group by spectrum.sales.eventid
order by 2 desc;

 eventid |   sum
---------+----------
     289 | 51846.00
    7895 | 51049.00
    1602 | 50301.00

This example query joins the external SALES table with an external EVENT table.


Related information

Querying external data using Amazon Redshift Spectrum

Troubleshooting queries in Amazon Redshift Spectrum

AWS 官方
AWS 官方已更新 1 年前