Automated change data capture (CDC) data ingestion from DynamoDB to Redshift

Lecture de 6 minute(s)
Niveau du contenu : Avancé
0

The solution enables change data capture (CDC) replication between DynamoDB and Redshift. This process makes it easier to replicate multiple tables without the need for coding replication for each table. Schema changes are detected and handled in the Redshift process to keep it in sync with DynamoDB. Also provides the ability to control multiple table CDC work load by scheduling for better cluster resource management.

CDC changes from DynamoDB table to Redshift table

Introduction

The solution enables change data capture (CDC) replication between DynamoDB and Redshift. This process makes it easier to replicate multiple tables without the need for coding replication for each table. Schema changes are detected and handled in the Redshift process to keep it in sync with DynamoDB. Also provides the ability to control multiple table CDC work load by scheduling for better cluster resource management.

At high level the process involves below activities

  1. Create a Kinesis data stream (KDS) and turn on the data stream to KDS for tables to be replicated from DynamoDB.
  2. Create a materialized view in your Amazon Redshift cluster to consume streaming data from the Kinesis Data Stream.
  3. Create and Execute Stored Procedure in Redshift. The streaming data gets ingested as a JSON payload. This JSON data is processed into Redshift tables via a Stored Procedure.

DynamoDB can stream multiple tables into a single stream. So, there will be one stream processed by Redshift to apply changes. On Redshift there is no need to write custom parsing procedures for each table.

Pre-Requisites

  1. Create Kinesis Data Stream. (Example name: redshift_cdc)
  2. Create a Dynamo DB table(s) and enable data streaming to Kinesis Data Stream created.
  3. Create an IAM role with ability to read KDS streams and attach it to Redshift cluster. (Sample Role details are below. Example name: my_streaming_role) You can find stored procedure and instructions on GitHub at location below. https://github.com/aws-samples/amazon-redshift-udfs/tree/master/stored-procedures/sp_cdc_DynamoDB_to_Redshift

Setup

Create Stored Procedure required for Redshift Cluster either in QEv2 or via any SQL tool using sp_cdc_DynamoDB_to_Redshift.sql file. Executing this file should create below routines.

  1. sp_ddb_to_redshift_setup_schema_mv(varchar,varchar,varchar). Setup routine to create materialized view and schema.
  2. sp_ddb_to_redshift_setup_process_tables(). Setup routine to create tables needed for ongoing replication process.
  3. sp_create_table_varchar_max(varchar). Routine to create table if it does not exist.
  4. sp_cursor_loop_alter_tables(). Routine to alter table if schema changes are detected.
  5. sp_cursor_loop_create_tables(). Routine to create table when multiple new tables are in CDC.
  6. sp_cursor_loop_process_merge_tables(). Routine to merge data to target Redshift table.
  7. sp_merge_table_key_data(). Routine to handle updates.sp_ddb_to_redshift_incremental_refresh_cdc(). Main routine to execute on demand or schedule.
  8. sp_delete_table_key_data(). Routine to handle deleted records.
  9. sp_ddb_to_redshift_incremental_refresh_cdc(). Main routine to execute on demand or schedule.

Execute setup routines

  1. Execute procedure below to create materialized view and schema replacing with IAM role, Account_Number and KDS name.

    call public.sp_ddb_to_redshift_setup_schema_mv('my_streaming_role','123456781234','redshift_cdc'); This will create necessary schema and materialized view to capture data from DynamoDB.

  2. Execute procedure below to create tables needed for replication process.

    call public.sp_ddb_to_redshift_setup_process_tables(). Verify list of tables in the procedure is created by refreshing schema.

Ongoing process Execute below procedure on demand or schedule:

call public.sp_ddb_to_redshift_incremental_refresh_cdc();

Verify table and data by running a query on Redshift cluster for the tables to be replicated. It should have captured data since DynamoDB table started streaming into Kinesis Data Stream. Refer query scheduling process in Redshift Query Editor v2 for query scheduling steps.

Notes:

  1. This is for CDC only for an existing table. In production scenario you may want to do a full copy/load via S3 and then use this process for ongoing changes.
  2. For a new table this can capture data from initial changes if process is configured prior to new data.
  3. You can run CDC with multiple DynamoDB source tables pointing to same Kinesis Data Stream.
  4. A new column is added to target table (dist_Key) to track keys and distribution.
  5. A new column is added to target table (dist_Key) to track keys and distribution.
  6. Target data table has all columns defined and stored as varchar to accommodate any future changes to attributes.

FAQ

Can this process be used for non DynamDb Json sources?

  • Not at this time. But the framework of process is reusable for other JSON data ingestion. This requires extending parts of the process to handle other Json formats.

What happens when multiple updates happen to a single key item within the window of refresh?

  • Only last image of the record will be processed ignoring other duplicate CDC entries.

Can this process run CDC for multiple table sources?

  • The process will handle As long as these multiple DynamoDB tables are streaming changes to same Kinesis Data Stream.

How data is stored in Materialized View?

  • Data is stored in SUPER data type column in JSON format.

What happens when a source table column is renamed?

  • Change record is processed as if a new column is added. Target table in Redshift will have both columns and old column will have NULL values.

Sample IAM role

{
 "Version": "2012-10-17",
 "Statement": [
   {
    "Sid": "ReadStream",
    "Effect": "Allow",
      "Action": [
         "kinesis:DescribeStreamSummary",
         "kinesis:GetShardIterator",
         "kinesis:GetRecords",
         "kinesis:DescribeStream"
       ],
   "Resource": "arn:aws:kinesis:*:123443211234:stream/*"
   },
   {
     "Sid": "ListStream",
     "Effect": "Allow",
     "Action": [
       "kinesis:ListStreams",
       "kinesis:ListShards"
     ],
     "Resource": "*"
   }
]
}

Observations from CDC (change data capture) performance testing between DynamoDB and Redshift.

  1. Not all records showed up in the KDS stream / Materialized View right away - Sometime it took two iterations.It is important to design KDS for throughput and latency.
  2. Single record update is not that different from group of records.
  3. Providing a minute for KDS to catch up and then run this process is recommended and
  4. Update/Delete/New record changes are running in the same time window and has no considerable difference between type of CDC image in terms of processing times.
  5. Initial execution is slower than next runs. This should normalize if this process is running every 15 mins on schedule.
  6. Better suited for Near - Real time use cases with 15 min window.
ScenarioIteration 1Iteration 2Iteration 3Iteration 4
Multiple tables pushing data to a single stream(5 tables)38 sec15 sec43 sec29 sec
Number of updates processed in one iteration(200)65 sec14 sec28 sec19 sec
Single record updated/added20 sec3 sec9 sec28 sec
Number of inserts processed in one iteration(200)9 sec31 sec25 sec12 sec
AWS
EXPERT
publié il y a un an3942 vues