Skip to content

Amazon Redshift Serverless cross-region data lake query capability

4 minute read
Content level: Intermediate
0

Amazon Redshift enables you to efficiently query and retrieve structured and semi-structured data from open format files in Amazon S3 data lake without having to load the data into Amazon Redshift tables. Amazon Redshift Serverless expands these capabilities by enabling you to query external data in Amazon S3 within the same account, across accounts, and across AWS regions.

Introduction

Amazon Redshift enables you to efficiently query and retrieve structured and semi-structured data from open format files in Amazon S3 data lake without having to load the data into Amazon Redshift tables. Amazon Redshift extends SQL capabilities to your data lake, enabling you to run analytical queries. Amazon Redshift supports a variety of open data formats, including Apache Iceberg, Parquet, CSV, and JSON. In Redshift, you can create references to externally stored data by providing a definition of its structure and the location of the data in Amazon S3. Additionally, you can also read data from tables stored in the AWS Glue Data Catalog.

Amazon Redshift Serverless queries of external data in Amazon S3 are not billed for separately and are included in the amount billed for Amazon Redshift Serverless in RPU-hr amounts. Review Amazon Redshift Serverless pricing. For cross region data lake query data transfer cost applies. The price of data transfer is different for different Regions.

Amazon Redshift Provisioned enables you to query external data in Amazon S3 within the same account and across accounts in the same region. Redshift Provisioned also allows you to load Parquet files into local tables from data stored on Amazon S3 in the same region

Amazon Redshift Serverless expands these capabilities by enabling you to query external data in Amazon S3 within the same account, across accounts, and across AWS regions. This is possible because an Amazon Redshift Serverless in one region can query data that's registered in an AWS Glue Data Catalog in a different AWS Region. In this post, we will walk through use cases for which you can use Amazon Redshift Serverless to perform cross-region data lake analytical tasks.

Setup

Architecture

Oregon (us-west-2): • Redshift Serverless endpoint

N. Virginia (us-east-1):

AWS Glue Database name: "inputdata"

Table name: "ny_pub"

AWS Glue database

Storage: Amazon S3 – data in parquet format

ny_pub data on Amazon S3

IAM Role Requirements:

  1. The IAM role must be attached to Redshift Serverless Namespace
  2. Required Permissions: o Read access to S3 bucket in us-east-1 o Read access to Glue Data Catalog in us-east-1

Note: Disable Enhanced VPC routing for the Redshift Serverless endpoint, otherwise you will see timeout error querying cross-region data lake query.

Error message: “ERROR: AwsClientException: - Failed to perform AWS request, curlError=Connection timeout after 30000 m”

Steps:

  1. Create external schema on Redshift serverless endpoint in us-west-2. Replace your IAM_ROLE.
 CREATE EXTERNAL SCHEMA crossregion_datalake_catalog
FROM DATA CATALOG
DATABASE 'inputdata'
IAM_ROLE DEFAULT
REGION 'us-east-1';
  1. Use case analyze cross region data lake data: Run following script to review cross region data lake table “ny_pub”.
SELECT TO_CHAR(pickup_datetime, 'YYYY-MM-DD'),COUNT(*)
FROM crossregion_datalake_catalog.ny_pub
WHERE YEAR = 2016 and Month = 01
GROUP BY 1
ORDER BY 2;

Analyze datalake table ny_pub

  1. Use case load table using cross region data lake data: In this example it is creating local_ny_pub table selecting from “ny_pub” table.
CREATE TABLE local_ny_pub AS 
SELECT * FROM crossregion_datalake_catalog.ny_pub WHERE year = 2009;
  1. Use case copy cross region Apache parquet columnar data:
CREATE TABLE public.local_ny_pub_copy (
    vendorid character varying(15),
    pickup_datetime timestamp without time zone,
    dropoff_datetime timestamp without time zone,
    ratecode integer,
    passenger_count integer,
    trip_distance double precision,
    fare_amount double precision,
    total_amount double precision,
    payment_type integer 
); 

COPY local_ny_pub_copy  FROM 's3://redshift-demos/data/NY-Pub/year=2010/month=10/type=yellow/'
FORMAT PARQUET
REGION 'us-east-1'
IAM_ROLE DEFAULT;

Cross region copy using format parquet

Conclusion

Using Amazon Redshift Serverless, you can perform cross-region data lake analytical use cases. Amazon Redshift doesn't require that your data in Amazon S3 be in the same AWS Region as your Redshift cluster. This solution shows how you can use Redshift to query data in S3 located in a different region using AWS Glue Data Catalog. Consider testing this approach with your data to validate whether it meets your workload needs.

Thanks to Milind Oke for review.

AWS
EXPERT
published a month ago124 views