Using CDC data - Insert/Update/Delete

0

Hello,

I have Change Capture Data landing into a S3 bucket. The CDC data is from an AWS RDS. This is my starting point.

Now, let's say, I have tables taba, tabb, tabc. I have the 3 folders for each table and all the CDC data is segregated per table in these folders. Now I must create another bucket, that will only hold the latest data.

so in taba folder, if I had rows with 10 inserts , 10 updates and 3 deletes. How do I bring it to a state of finalization in the new bucket, meaning I should only have the latest data?

2 Answers
1

Did you look into DMS for migrating CDC data into AWS S3? In case your RDS source instance is in one of these: PostgresSQL, MySQL, Oracle or SQL Server, the native CDC of the database is supported and can handle ongoing replication between your RDS database and S3.

Building your own logic would be tedious and would work differently for different databases. If you use is to really start from insert, update and delete data and it has a timestamp column in common that identifies the order of operations, you could use Glue to build out the logic using Python or Pyspark. Each database has a different CDC process and your Glue program would have to reflect the data available in those datasets.

Reference about DMS CDC: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Task.CDC.html

profile pictureAWS
answered 2 years ago
  • Thanks ananthtm. Let me explain better, that data that has arrived into S3 is from Kafka and the data in S3 looks like CDC data, with I,U and D flags. I want to now convert this data, CDC data, into regular data.

    So for example, the following is present in S3.

    I, 100, 'John',1000 I,101,'Sam',2000 U,100,'John',1500

    I want to convert this into,

    100,'John',1500 101,'Sam',2000

    And I want the above in another bucket

0

As far as I know, this needs a custom solution.

Let's acknowledge that Amazon S3 is not a database, hence what you are trying to achieve would be not possible with AWS DMS (or any/most of replication tool for that matter). Even if you use AWS S3 as Source with DMS, the eventual data on the target i.e. S3 will still have the DML operation types like I,U,D.

I am not sure if the use case, but the core requirement is to NOT have the transactions listed as is and just store the volatile data, then you have to look at any database as target, instead of AWS S3 with AWS DMS.

Or, you should look at any custom solution which could read the data from S3 and consolidate on another S3 bucket as per the requirement.

AWS
D-Rao
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions