This post discusses how customers can ingest data from delta lake into Amazon Redshift using AWS Glue in both full load and change data capture format.
Introduction
Amazon Redshift is fully managed data warehousing service by Amazon Web Services (AWS) that tens of thousands of customers use every day to run SQL analytics in the cloud, processing exabytes of data for business insights. Whether your growing data is stored in operational data stores, data lakes, streaming data services or third-party datasets, Amazon Redshift helps you securely access, combine, and share data with minimal movement or copying.
Amazon Redshift Spectrum, a feature of Amazon Redshift, enables you to query your S3 data lake directly from your Redshift data warehouses without first loading the data into it, minimizing time to insight. However, often customers want to ingest the data from their data lake into Amazon Redshift for addressing their strict SLA requirements for their analytical workloads, save cost or use any native Redshift features like data sharing, or merging data with operational data using zero-ETL.
Architecture pattern

Prerequisites
Before implementing the steps in this solutions walkthrough, please ensure you have the following prerequisites:
To copy sample data:
aws s3 sync s3://aws-bigdata-blog/artifacts/delta-lake-crawler/sample_delta_table/ s3://<your_s3_bucket>/data/sample_delta_table
Solution Overview
The solution workflow consists of the following steps:
- Create a Glue IAM service role
- Configure an AWS Glue Crawler to create a lineitem source table
- Create a glue connection to Amazon Redshift Cluster
- Configure an AWS Glue Crawler to create a deltalake_lineitem target table
- Configure one-time full data load from Delta Lake to Amazon Redshift
- Configure Incremental data load from Delta Lake to Amazon Redshift
Implementation Steps
1. Create a Glue IAM Service Role
- Navigate to Roles in AWS IAM console
- Choose "Create role"
- Select "AWS service" as trusted entity type and choose "AWS Glue"
- Add permissions:
- AWSGlueServiceRole
- AmazonS3FullAccess

- Name the role (e.g., "AWSGlueServiceRoleDefault")
2. Create Source S3 Bucket Crawler
- In AWS Glue Console, go to Crawlers
- Choose "Create crawler"
- Configure crawler:
- Name: e.g., "delta-lake-symlink"
- Data source: Choose S3
- Location: Select "In this account"
- Specify S3 path containing TPCH data
- Select the IAM role created earlier

3. Create Amazon Redshift Connection
- Navigate to Connections in Data Catalog
- Choose "Create Connection"
- Select Amazon Redshift as data source
- Configure connection details:
- Select target Redshift cluster
- Specify database name and credentials
- Name the connection (e.g., "Redshift Connection")
4. Create Target Redshift Crawler
Note create Redshift connection prior to this step,
a. Under Connections, choose Create Connection.

b. On the next page, under Data Sources, choose Amazon Redshift and choose Next.

c. Under Connection details, choose the target Redshift cluster from the drop down, and specify database name and credentials as shown in the following screenshot. Then, choose Next

After successfully creating the connection follow the below steps,
- Create new crawler in AWS Glue Console
- Configure:
- Name: e.g., "delta-lake-redshift-crawler"
- Data Source: JDBC
- Select Redshift Connection
- Provide database, schema, and table names
- Use existing IAM role
5. Create Full Load AWS Glue ETL Job
- Navigate to ETL jobs in Glue Studio
- Choose Visual ETL
- Configure job:
- Name: e.g., "blog-deltalake-to-redshift-fullload"
- Source: Amazon S3
- Target: Amazon Redshift
- Configure S3 node with Database and Table
- Configure Redshift target:
- Access type: "Direct data connection"
- Schema: "public"
- Table: "deltalake_lineitem"
- Handling: "Append (insert) to target table"

6. Incremental Data Load Setup
- Create source crawler
- Create target crawler with JDBC connection
- Create Glue job for incremental loads:
- Source: Amazon S3
- Target: Amazon Redshift
- Access type: "Glue Data Catalog tables"
- Database: "deltalake-db"
- Table: "dev/public/deltalake_stage_lineitem"
- Handling: "Merge data into target table"

Validation
- Use Redshift Query Editor V2 to verify data loads
- Run count queries after full and incremental loads
Orchestration
Configure AWS EventBridge and AWS Glue workflows for automated incremental processing when new files arrive in S3. Follow this blog for more details
Best Practices
- Refer to Redshift documentation for table design optimization
- Implement CDC process to capture and replicate changed data
- Use AWS Glue in-built operators for efficient incremental changes
Clean up
To avoid charges, delete:
- S3 bucket
- AWS Glue crawler
- AWS Glue database
- AWS Glue table
- Amazon Redshift
Conclusion
AWS Glue crawlers can discover source datasets, extract schema information, and populate the AWS Glue Data Catalog from delta lake tables. AWS Glue job can be developed using native merge functionality to merge incremental data into Amazon Redshift, reducing time and cost for incremental processing.