Preserving Transaction Order When Migrating Postgresql to S3 with DMS

0

We are currently in the process of migrating a Postgresql database from RDS to an S3 data lake. During our data migration, we noticed an important aspect related to transaction order preservation that we need to address.

Upon inspecting the CDC data stored as parquet files in S3, we discovered that multiple rows with the same identifier (id) are committed within the same transaction. This means that they share the same commit timestamp, making it challenging for us to identify the latest change for each id. The distinguishing factor among these rows is the 'modified_at' column, and there may be other relevant columns as well. Here's an example:

Opidcommit_timestampmodified_attest_col
I12023-10-11 08:45:002023-10-11 08:44:20a
U12023-10-11 08:45:002023-10-11 08:44:30b
U12023-10-11 08:45:002023-10-11 08:44:35c

Our goal is to retrieve the latest change for each id while considering all relevant columns. However, we can't rely solely on partitioning by id + commit_timestamp + modified_at because there may be other columns with timestamps that could influence the order, and we don't know which columns those might be.

We've observed that the data in the parquet files on S3 seems to be ordered as we expect in some cases, but we can't be entirely certain if this is always the case, given that we have many tables replicated to S3. We've manually inspected approximately 10 files, and they appear to be in order. We also noticed a setting called "PreserveTransactions" that can be used on the S3 target. However, according to the documentation, the results are stored in .csv transaction files, represented as operation, table_name, database_schema_name, field_value. Unfortunately, this doesn't address our specific issue because the transaction .csv files do not contain the id (PK) of the row, and we're unsure how to correlate this information with the parquet data.

Our key question is this: Is there a specific setting within Postgresql source, AWS Database Migration Service (DMS), or the S3 target that can guarantee the preservation of transaction order when migrating data from Postgresql to S3 in parquet format? Alternatively, is transaction order already preserved by default, and if so, how can we ensure this consistently across our migration process?

1 Answer
2

Hi,

the right metadata to use to get and preserve order is AR_H_CHANGE_SEQ

See https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TableMapping.SelectionTransformation.Expressions.html

AR_H_CHANGE_SEQ:  A unique incrementing number from the source database that consists of 
a timestamp and an auto incrementing number. The value depends on the source database system.

We use it for same purpose as you in one of our projects

Best,

Didier

profile pictureAWS
EXPERT
answered 7 months ago
  • Hi Didier, thank you for your comment.

    I'm wondering what the last 19 digits of AR_H_CHANGE_SEQ are? We're replicating a PostgreSQL database and we have pglogical turned on.

  • Hello everyone, I have the exact same issue. AR_H_CHANGE_SEQ is the way to go. But for some reason, if I add transformation rule of adding a new column to the existing task, rule gets ignored. Any thoughts on this?

    I am running full load + cdc on the 3.5.1 engine. PG as source, s3 as the target.

  • Hi milotr,

    This might be best for its own topic so we can gather more details from you and not derail the topic here. I would like to know if you are using pglogical or test_decoding in this task.

  • I tried that as well and I have encountered the same issue. We're using pglogical

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions