Skip to content

How does Amazon Data Firehose use snapshot watermark tables for resuming database replication without full reload?

0

Hi all,

I'm working with Amazon Data Firehose for near real-time database replication into Apache Iceberg on S3, based on the recent blog post. My source is PostgreSQL, and I'm trying to better understand how Firehose handles snapshot watermark tables during restarts.

From my understanding, these snapshot watermark tables play a role similar to how AWS DMS tracks replication progress—potentially allowing Firehose to avoid redoing a full snapshot after a restart. However, the documentation isn’t clear on how this works in practice.

My specific questions:

  1. How exactly does Firehose use snapshot watermark tables to resume from where it left off after a stream restart?
  2. Is there a documented process or best practice for manually updating or inserting into the watermark table (e.g. to skip reloading already-processed data)?
  3. Can anyone provide guidance on the schema of the watermark table and what values are required to ensure consistent resumption?

The goal is to avoid reprocessing the full dataset in cases where Firehose needs to be restarted or redeployed.

Any insight, or pointers to relevant documentation would be much appreciated.

asked a year ago193 views

1 Answer
0

Amazon Data Firehose uses snapshot watermark tables to track the progress of incremental data copying from source databases, allowing it to resume operations from where it stopped rather than restarting the entire process.

When you set up a Firehose stream with a database source, a watermark table is required in your source database. This table stores markers that help Firehose identify which parts of your database tables have already been captured and which parts still need processing. This is particularly useful during the initial snapshot phase of data replication.

The watermark table functions as follows:

  • It tracks the progress of incremental snapshots of your database tables
  • If there are any interruptions in the data copying process, Firehose can use these watermarks to resume from where it left off
  • This approach prevents the need to recapture the entire table when interruptions occur

You have two options for creating this watermark table:

  1. Let Firehose automatically create it during stream creation (you'll need to provide database credentials with table creation permissions)
  2. Create it manually before setting up the Firehose stream

The watermark table has a specific structure with three columns:

  • id (varchar(64)) - Primary key
  • type (varchar(32))
  • data (varchar(2048))

For MySQL databases, there's a provided SQL procedure called setupFirehose that can create the watermark table, set up a database user for Firehose access, and grant the necessary permissions.

While Firehose continuously delivers data from the source to the destination once configured, the watermark table specifically helps with tracking progress during initial data loading and handling interruptions, ensuring efficient resumption of operations without requiring a full reload of data.
Sources
Prerequisites to use database as a source - Amazon Data Firehose
Replicate changes from databases to Apache Iceberg tables using Amazon Data Firehose (in preview) | AWS News Blog
MySQL - RDS, Aurora and self-managed databases running on Amazon EC2 - Amazon Data Firehose

answered a year ago

  • My specific questions: How exactly does Firehose use snapshot watermark tables to resume from where it left off after a stream restart? Is there a documented process or best practice for manually updating or inserting into the watermark table (e.g. to skip reloading already-processed data)? Can anyone provide guidance on the schema of the watermark table and what values are required to ensure consistent resumption?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.