- Newest
- Most votes
- Most comments
Amazon Data Firehose uses snapshot watermark tables to track the progress of incremental data copying from source databases, allowing it to resume operations from where it stopped rather than restarting the entire process.
When you set up a Firehose stream with a database source, a watermark table is required in your source database. This table stores markers that help Firehose identify which parts of your database tables have already been captured and which parts still need processing. This is particularly useful during the initial snapshot phase of data replication.
The watermark table functions as follows:
- It tracks the progress of incremental snapshots of your database tables
- If there are any interruptions in the data copying process, Firehose can use these watermarks to resume from where it left off
- This approach prevents the need to recapture the entire table when interruptions occur
You have two options for creating this watermark table:
- Let Firehose automatically create it during stream creation (you'll need to provide database credentials with table creation permissions)
- Create it manually before setting up the Firehose stream
The watermark table has a specific structure with three columns:
- id (varchar(64)) - Primary key
- type (varchar(32))
- data (varchar(2048))
For MySQL databases, there's a provided SQL procedure called setupFirehose that can create the watermark table, set up a database user for Firehose access, and grant the necessary permissions.
While Firehose continuously delivers data from the source to the destination once configured, the watermark table specifically helps with tracking progress during initial data loading and handling interruptions, ensuring efficient resumption of operations without requiring a full reload of data.
Sources
Prerequisites to use database as a source - Amazon Data Firehose
Replicate changes from databases to Apache Iceberg tables using Amazon Data Firehose (in preview) | AWS News Blog
MySQL - RDS, Aurora and self-managed databases running on Amazon EC2 - Amazon Data Firehose
answered a year ago
Relevant content
asked a year ago
asked a year ago
asked 9 months ago
- AWS OFFICIALUpdated a year ago

My specific questions: How exactly does Firehose use snapshot watermark tables to resume from where it left off after a stream restart? Is there a documented process or best practice for manually updating or inserting into the watermark table (e.g. to skip reloading already-processed data)? Can anyone provide guidance on the schema of the watermark table and what values are required to ensure consistent resumption?