Ingesting full load and CDC data from Delta Lake to Redshift using AWS Glue native features

5 minute read
Content level: Advanced
0

This post discusses how customers can ingest data from delta lake into Amazon Redshift using AWS Glue in both full load and change data capture format.

Introduction

Amazon Redshift is fully managed data warehousing service by Amazon Web Services (AWS) that tens of thousands of customers use every day to run SQL analytics in the cloud, processing exabytes of data for business insights. Whether your growing data is stored in operational data stores, data lakes, streaming data services or third-party datasets, Amazon Redshift helps you securely access, combine, and share data with minimal movement or copying.

Amazon Redshift Spectrum, a feature of Amazon Redshift, enables you to query your S3 data lake directly from your Redshift data warehouses without first loading the data into it, minimizing time to insight. However, often customers want to ingest the data from their data lake into Amazon Redshift for addressing their strict SLA requirements for their analytical workloads, save cost or use any native Redshift features like data sharing, or merging data with operational data using zero-ETL.

Architecture pattern

Enter image description here

Prerequisites

Before implementing the steps in this solutions walkthrough, please ensure you have the following prerequisites:

To copy sample data:

aws s3 sync s3://aws-bigdata-blog/artifacts/delta-lake-crawler/sample_delta_table/ s3://<your_s3_bucket>/data/sample_delta_table

Solution Overview

The solution workflow consists of the following steps:

  1. Create a Glue IAM service role
  2. Configure an AWS Glue Crawler to create a lineitem source table
  3. Create a glue connection to Amazon Redshift Cluster
  4. Configure an AWS Glue Crawler to create a deltalake_lineitem target table
  5. Configure one-time full data load from Delta Lake to Amazon Redshift
  6. Configure Incremental data load from Delta Lake to Amazon Redshift

Implementation Steps

1. Create a Glue IAM Service Role

  1. Navigate to Roles in AWS IAM console
  2. Choose "Create role"
  3. Select "AWS service" as trusted entity type and choose "AWS Glue"
  4. Add permissions:
    • AWSGlueServiceRole
    • AmazonS3FullAccess

Enter image description here

  1. Name the role (e.g., "AWSGlueServiceRoleDefault")

2. Create Source S3 Bucket Crawler

  1. In AWS Glue Console, go to Crawlers
  2. Choose "Create crawler"
  3. Configure crawler:
    • Name: e.g., "delta-lake-symlink"
    • Data source: Choose S3
    • Location: Select "In this account"
    • Specify S3 path containing TPCH data
    • Select the IAM role created earlier

Enter image description here

3. Create Amazon Redshift Connection

  1. Navigate to Connections in Data Catalog
  2. Choose "Create Connection"
  3. Select Amazon Redshift as data source
  4. Configure connection details:
    • Select target Redshift cluster
    • Specify database name and credentials
    • Name the connection (e.g., "Redshift Connection")

4. Create Target Redshift Crawler

Note create Redshift connection prior to this step, a. Under Connections, choose Create Connection.

Enter image description here

b. On the next page, under Data Sources, choose Amazon Redshift and choose Next.

Enter image description here

c. Under Connection details, choose the target Redshift cluster from the drop down, and specify database name and credentials as shown in the following screenshot. Then, choose Next

Enter image description here Enter image description here

After successfully creating the connection follow the below steps,

  1. Create new crawler in AWS Glue Console
  2. Configure:
    • Name: e.g., "delta-lake-redshift-crawler"
    • Data Source: JDBC
    • Select Redshift Connection
    • Provide database, schema, and table names
    • Use existing IAM role

5. Create Full Load AWS Glue ETL Job

  1. Navigate to ETL jobs in Glue Studio
  2. Choose Visual ETL
  3. Configure job:
    • Name: e.g., "blog-deltalake-to-redshift-fullload"
    • Source: Amazon S3
    • Target: Amazon Redshift
    • Configure S3 node with Database and Table
    • Configure Redshift target:
      • Access type: "Direct data connection"
      • Schema: "public"
      • Table: "deltalake_lineitem"
      • Handling: "Append (insert) to target table"

Enter image description here

6. Incremental Data Load Setup

  1. Create source crawler
  2. Create target crawler with JDBC connection
  3. Create Glue job for incremental loads:
    • Source: Amazon S3
    • Target: Amazon Redshift
    • Access type: "Glue Data Catalog tables"
    • Database: "deltalake-db"
    • Table: "dev/public/deltalake_stage_lineitem"
    • Handling: "Merge data into target table"

Enter image description here

Validation

  • Use Redshift Query Editor V2 to verify data loads
  • Run count queries after full and incremental loads

Orchestration

Configure AWS EventBridge and AWS Glue workflows for automated incremental processing when new files arrive in S3. Follow this blog for more details

Best Practices

  • Refer to Redshift documentation for table design optimization
  • Implement CDC process to capture and replicate changed data
  • Use AWS Glue in-built operators for efficient incremental changes

Clean up

To avoid charges, delete:

  • S3 bucket
  • AWS Glue crawler
  • AWS Glue database
  • AWS Glue table
  • Amazon Redshift

Conclusion

AWS Glue crawlers can discover source datasets, extract schema information, and populate the AWS Glue Data Catalog from delta lake tables. AWS Glue job can be developed using native merge functionality to merge incremental data into Amazon Redshift, reducing time and cost for incremental processing.

profile pictureAWS
EXPERT
published 20 days ago127 views