Skip to content

Is it possible to do an AWS Redshift Cross-Cluster Federated Query?

0

I have 2 different Redshift clusters. Each cluster have a PostgreSQL database with the same schema.

I want to do a federated query across these 2 databases in the 2 different Redshift clusters.

Is it possible? Please guide.

asked 2 years ago967 views
2 Answers
0

Performing federated queries directly across two different Amazon Redshift clusters is not supported out-of-the-box as of now but there are some work arounds to achieve similar functionality.

You can try these steps below.

  1. The first option is using Amazon Redshift Spectrum and Amazon S3. You can export data from one Redshift cluster to Amazon S3 and then query it from the other Redshift cluster using Redshift Spectrum.
  • Export Data to S3 from Cluster 1:

  • Use the 'UNLOAD' command to export the required data from the first Redshift cluster to an S3 bucket.

    UNLOAD ('SELECT * FROM your_table')
    TO 's3://your-bucket/your-prefix/your-table-'
    IAM_ROLE 'your-iam-role-arn'
    PARALLEL OFF;
    
  • In the second Redshift cluster, create an external schema that points to the S3 bucket.

    CREATE EXTERNAL SCHEMA spectrum_schema
    FROM DATA CATALOG
    DATABASE 'your_glue_database'
    IAM_ROLE 'your-iam-role-arn'
    CREATE EXTERNAL DATABASE IF NOT EXISTS;
    
  • Create external tables in the second cluster that reference the data in S3.

    CREATE EXTERNAL TABLE spectrum_schema.your_table (
            column1 data_type,
            column2 data_type,
            ...
    )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    LOCATION 's3://your-bucket/your-prefix/';
    
  • You can now query the data from both local tables and the external tables.

    SELECT a.*, b.*
    FROM local_table a
    JOIN spectrum_schema.your_table b
    ON a.id = b.id;
    
  1. Another option is to use AWS Lambda functions to pull data from one cluster and insert it into the other.
  • Create a Lambda function that connects to the source Redshift cluster, executes the query, and inserts the results into the target Redshift cluster.

Here is an example of what the Lambda function might look like using Python and the psycopg2 library.

  import psycopg2
  import os

  def lambda_handler(event, context):
         src_conn = psycopg2.connect(
                dbname='source_db',
                user='username',
                password='password',
                host='source_cluster_endpoint',
                port='5439'
        )
        tgt_conn = psycopg2.connect(
               dbname='target_db',
               user='username',
               password='password',
               host='target_cluster_endpoint',
               port='5439'
       )

      src_cursor = src_conn.cursor()
      tgt_cursor = tgt_conn.cursor()

      src_cursor.execute('SELECT * FROM your_table')
      rows = src_cursor.fetchall()

      for row in rows:
            tgt_cursor.execute(
                   'INSERT INTO your_table (column1, column2, ...) VALUES (%s, %s, ...)', row
            )

      src_conn.commit()
      tgt_conn.commit()

      src_cursor.close()
      tgt_cursor.close()
      src_conn.close()
      tgt_conn.close()

      return {'status': 'success'}

3. You could also consider using ETL (Extract, Transform, Load) tools like AWS Glue. This can automate the process of extracting data from one Redshift cluster and loading it into another.

  • Create a Glue job that extracts data from the source Redshift cluster, transforms it if needed, and loads it into the target Redshift cluster. Configuration Steps:

  • Configure Glue connections for both Redshift clusters.

  • Create a Glue job with a script that handles the data transfer.

  • Schedule the job or trigger it as needed.

Choose whichever method best fits your use case and requirements.

For more detail on these topics refer to the following:

Amazon Redshift Spectrum Documentation

Creating External Tables For Redshift Spectrum

Querying External Data Using Amazon Redshift Spectrum

AWS
answered 2 years ago
0

Assuming you have local data on two different Amazon Redshift clusters, you can securely access this data across both clusters using Amazon Redshift data sharing.

This feature allows you to share access to live data between Redshift clusters, even if they are located in different AWS accounts or regions.

Please refer to the following documentation for detailed instructions on how to set up and manage data sharing.

https://docs.aws.amazon.com/redshift/latest/dg/datashare-overview.html https://aws.amazon.com/blogs/big-data/sharing-amazon-redshift-data-securely-across-amazon-redshift-clusters-for-workload-isolation/

If you have external tables set up using Redshift Federated queries against PostgreSQL, you can copy the data locally onto your Redshift cluster and still take advantage of the Data Sharing feature between Redshift clusters.

AWS
EXPERT
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.