Automated change data capture (CDC) data ingestion from DynamoDB to Redshift
The solution enables change data capture (CDC) replication between DynamoDB and Redshift. This process makes it easier to replicate multiple tables without the need for coding replication for each table. Schema changes are detected and handled in the Redshift process to keep it in sync with DynamoDB. Also provides the ability to control multiple table CDC work load by scheduling for better cluster resource management.
CDC changes from DynamoDB table to Redshift table
Introduction
The solution enables change data capture (CDC) replication between DynamoDB and Redshift. This process makes it easier to replicate multiple tables without the need for coding replication for each table. Schema changes are detected and handled in the Redshift process to keep it in sync with DynamoDB. Also provides the ability to control multiple table CDC work load by scheduling for better cluster resource management.
At high level the process involves below activities
- Create a Kinesis data stream (KDS) and turn on the data stream to KDS for tables to be replicated from DynamoDB.
- Create a materialized view in your Amazon Redshift cluster to consume streaming data from the Kinesis Data Stream.
- Create and Execute Stored Procedure in Redshift. The streaming data gets ingested as a JSON payload. This JSON data is processed into Redshift tables via a Stored Procedure.
DynamoDB can stream multiple tables into a single stream. So, there will be one stream processed by Redshift to apply changes. On Redshift there is no need to write custom parsing procedures for each table.
Pre-Requisites
- Create Kinesis Data Stream. (Example name: redshift_cdc)
- Create a Dynamo DB table(s) and enable data streaming to Kinesis Data Stream created.
- Create an IAM role with ability to read KDS streams and attach it to Redshift cluster. (Sample Role details are below. Example name: my_streaming_role) You can find stored procedure and instructions on GitHub at location below. https://github.com/aws-samples/amazon-redshift-udfs/tree/master/stored-procedures/sp_cdc_DynamoDB_to_Redshift
Setup
Create Stored Procedure required for Redshift Cluster either in QEv2 or via any SQL tool using sp_cdc_DynamoDB_to_Redshift.sql file. Executing this file should create below routines.
- sp_ddb_to_redshift_setup_schema_mv(varchar,varchar,varchar). Setup routine to create materialized view and schema.
- sp_ddb_to_redshift_setup_process_tables(). Setup routine to create tables needed for ongoing replication process.
- sp_create_table_varchar_max(varchar). Routine to create table if it does not exist.
- sp_cursor_loop_alter_tables(). Routine to alter table if schema changes are detected.
- sp_cursor_loop_create_tables(). Routine to create table when multiple new tables are in CDC.
- sp_cursor_loop_process_merge_tables(). Routine to merge data to target Redshift table.
- sp_merge_table_key_data(). Routine to handle updates.sp_ddb_to_redshift_incremental_refresh_cdc(). Main routine to execute on demand or schedule.
- sp_delete_table_key_data(). Routine to handle deleted records.
- sp_ddb_to_redshift_incremental_refresh_cdc(). Main routine to execute on demand or schedule.
Execute setup routines
-
Execute procedure below to create materialized view and schema replacing with IAM role, Account_Number and KDS name.
call public.sp_ddb_to_redshift_setup_schema_mv('my_streaming_role','123456781234','redshift_cdc'); This will create necessary schema and materialized view to capture data from DynamoDB.
-
Execute procedure below to create tables needed for replication process.
call public.sp_ddb_to_redshift_setup_process_tables(). Verify list of tables in the procedure is created by refreshing schema.
Ongoing process Execute below procedure on demand or schedule:
call public.sp_ddb_to_redshift_incremental_refresh_cdc();
Verify table and data by running a query on Redshift cluster for the tables to be replicated. It should have captured data since DynamoDB table started streaming into Kinesis Data Stream. Refer query scheduling process in Redshift Query Editor v2 for query scheduling steps.
Notes:
- This is for CDC only for an existing table. In production scenario you may want to do a full copy/load via S3 and then use this process for ongoing changes.
- For a new table this can capture data from initial changes if process is configured prior to new data.
- You can run CDC with multiple DynamoDB source tables pointing to same Kinesis Data Stream.
- A new column is added to target table (dist_Key) to track keys and distribution.
- A new column is added to target table (dist_Key) to track keys and distribution.
- Target data table has all columns defined and stored as varchar to accommodate any future changes to attributes.
FAQ
Can this process be used for non DynamDb Json sources?
- Not at this time. But the framework of process is reusable for other JSON data ingestion. This requires extending parts of the process to handle other Json formats.
What happens when multiple updates happen to a single key item within the window of refresh?
- Only last image of the record will be processed ignoring other duplicate CDC entries.
Can this process run CDC for multiple table sources?
- The process will handle As long as these multiple DynamoDB tables are streaming changes to same Kinesis Data Stream.
How data is stored in Materialized View?
- Data is stored in SUPER data type column in JSON format.
What happens when a source table column is renamed?
- Change record is processed as if a new column is added. Target table in Redshift will have both columns and old column will have NULL values.
Sample IAM role
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadStream",
"Effect": "Allow",
"Action": [
"kinesis:DescribeStreamSummary",
"kinesis:GetShardIterator",
"kinesis:GetRecords",
"kinesis:DescribeStream"
],
"Resource": "arn:aws:kinesis:*:123443211234:stream/*"
},
{
"Sid": "ListStream",
"Effect": "Allow",
"Action": [
"kinesis:ListStreams",
"kinesis:ListShards"
],
"Resource": "*"
}
]
}
Observations from CDC (change data capture) performance testing between DynamoDB and Redshift.
- Not all records showed up in the KDS stream / Materialized View right away - Sometime it took two iterations.It is important to design KDS for throughput and latency.
- Single record update is not that different from group of records.
- Providing a minute for KDS to catch up and then run this process is recommended and
- Update/Delete/New record changes are running in the same time window and has no considerable difference between type of CDC image in terms of processing times.
- Initial execution is slower than next runs. This should normalize if this process is running every 15 mins on schedule.
- Better suited for Near - Real time use cases with 15 min window.
Scenario | Iteration 1 | Iteration 2 | Iteration 3 | Iteration 4 |
---|---|---|---|---|
Multiple tables pushing data to a single stream(5 tables) | 38 sec | 15 sec | 43 sec | 29 sec |
Number of updates processed in one iteration(200) | 65 sec | 14 sec | 28 sec | 19 sec |
Single record updated/added | 20 sec | 3 sec | 9 sec | 28 sec |
Number of inserts processed in one iteration(200) | 9 sec | 31 sec | 25 sec | 12 sec |
関連するコンテンツ
- 質問済み 1年前lg...
- AWS公式更新しました 2年前