Skip to content

CDC Stream Configuration and Best Practices for Managed Apache Cassandra Service

7 minute read
Content level: Intermediate
6

Amazon Keyspaces CDC enables real-time tracking of row-level changes (inserts, updates, deletes) using a pull-based model. Customers retrieve changes via CDC endpoints using the Keyspaces KCL Adaptor. It supports use cases like analytics, auditing, and replication, requiring custom consumer apps for integration with services like Kinesis Data Streams.

Amazon Keyspaces (for Apache Cassandra) provides Change Data Capture (CDC) functionality that enables real-time capture of data modifications at the row level. CDC in Keyspaces is a pull-based model where change data is made available through CDC endpoints, and customers are responsible for actively retrieving the changes by utilizing the Keyspaces KCL Adaptor. CDC streams allow you to track inserts, updates, and deletes in your Keyspaces tables and process these changes for downstream applications such as data analytics, auditing, replication, and event-driven architectures. To integrate with services like Kinesis Data Streams, customers must build consumer applications that pull data from Keyspaces CDC endpoints. This article provides a comprehensive guide to configuring CDC streams, implementing optimal stream processing architectures, optimizing costs, and troubleshooting common CDC data retrieval issues using CLI commands and SDK implementations.

What is Capture Data Changes(CDC) in Amazon Keyspaces?

Amazon Keyspaces CDC provides ability for client applications to consume row-level change events in near-real time. Amazon Keyspaces CDC enables event-driven use cases such as industrial IoT and fraud detection as well as data processing use cases like full-text search and data archival. The change events that Amazon Keyspaces CDC captures in streams can be consumed by downstream applications that perform business-critical functions such as data analytics, text search, ML training/inference, and continuous data backups for archival. For example, you can transfer stream data to AWS analytics and storage services like Amazon OpenSearch Service, Amazon Redshift, and Amazon S3 for further processing.

Amazon Keyspaces CDC offers time-ordered and de-duplicated change records for tables, with automatic scaling of data throughput and retention time of up to 24 hours. Amazon Keyspaces CDC streams are completely serverless, and you don't need to manage the data infrastructure for capturing change events. In addition, Amazon Keyspaces CDC doesn't consume any table capacity for either compute or storage.

Best practices for stream processing architectures

Use Kinesis Client Library (KCL) for Stream Processing

Designing an efficient and resilient stream processing architecture is critical for successfully implementing CDC with Amazon Keyspaces. The architecture you choose will impact reliability, scalability, latency, and operational costs. One of the most important decisions is selecting the right approach for consuming and processing CDC data. Instead of working directly with the Amazon Keyspaces Streams API, working with the Kinesis Client Library (KCL) provides many benefits, for example:

  • Built-in shard lineage tracking and iterator handling
  • Automatic load balancing across workers
  • Fault tolerance and recovery from worker failures
  • Checkpointing to track processing progress
  • Adaptation to changes in stream capacity
  • Simplified distributed computing for processing CDC records

KCL handles many of the complex tasks associated with distributed computing, letting you focus on implementing your business logic when processing stream data. It provides useful abstractions above the low-level Kinesis Data Streams API, making it easier to develop robust stream processing applications.

Using Amazon Keyspaces Streams Kinesis Adapter

To write applications using the KCL with Amazon Keyspaces, you must use the Amazon Keyspaces Streams Kinesis Adapter. The Kinesis Adapter implements the Kinesis Data Streams interface, allowing you to use the KCL for consuming and processing records from Amazon Keyspaces streams. To integrate the adapter, applications can either build from source available on GitHub[https://github.com/aws/keyspaces-streams-kinesis-adapter] or add the dependency from Maven Central repository.

Shard Iterator Types

Amazon Keyspaces CDC streams support four shard iterator types that determine where consumption begins. TRIM_HORIZON starts reading from the oldest available record in the shard, making it suitable for processing historical changes or initializing a new consumer that needs complete data history. LATEST begins reading from new records only, skipping existing data in the stream. Choose TRIM_HORIZON when you need to process backlogged or historical data, and LATEST when your application only requires near real-time processing of ongoing changes. AT_SEQUENCE_NUMBER starts reading from a specific sequence number, while AFTER_SEQUENCE_NUMBER begins reading immediately after a specified sequence number. AT_SEQUENCE_NUMBER and AFTER_SEQUENCE_NUMBER are useful for resuming consumption from a known checkpoint position.

Building Custom CDC Consumer Applications

Consuming Keyspaces CDC streams using the APIs directly requires a continuously running application that maintains persistent connections and manages shard iterators efficiently. Long-running compute environments are well-suited for this workload because they run indefinitely without execution time constraints, maintain state in memory, and handle high-volume streams reliably.

When building custom consumer applications, implement these critical capabilities:

State Tracking: Applications need to ensure they are tracking and persisting the sequence number up to which they have consumed. This ensure there is no duplication in processing when process re-starts etc

Lineage Tracking: Lineage tracking: Applications need to ensure they consume records in order of Shard Lineage (parent-child graph that can be built from the APIs), this is similar to what KCL does. Without this logic, customers lose ordering guarantee

Troubleshooting common CDC data retrieval issues using CLI commands and SDK implementations.

1. Latest data not appearing in your Amazon Keyspaces CDC (Change Data Capture) stream

There are several potential causes for CDC data not appearing or being delayed in your Keyspaces stream. Here few general guidance points:

  • Propagation to CDC Stream is asynchronous and delay is expected.

  • Another cause is if the consuming application is processing change events at a slower pace than the writes to the table. This cause the iterator to not be close to the tip which is latest events.

  • Verify that you have the necessary IAM permissions for working with Keyspaces CDC streams:

cassandra: Select on the source table

  • “cassandra:Select“,
  • “cassandra:ListStreams”,
  • “cassandra:GetStream”,
  • "cassandra:GetShardIterator",
  • "cassandra:GetRecords"

For a better approach for the above, please consider doing the following:

  1. Verify CDC is enabled and check the stream status
  2. Confirm recent write operations to your Keyspaces table
  3. Check CloudWatch metrics for your Kinesis stream to see if records are being written
  4. Verify your shard iterator and reading position in the stream

2. Understanding Duplicate Record Retrieval

Keyspaces streams are pull-based, meaning reading a record does not remove it from the stream. When using TRIM_HORIZON, the consumer reads from the oldest untrimmed record, and since records remain for approximately 24 hours before expiration, re-running the consumer without proper checkpointing retrieves the same records again. This behavior reinforces why stream consumption typically uses continuously running applications with LATEST iterator type rather than repeated executions from TRIM_HORIZON. Keyspaces streams are time-ordered and de-duplicated at the shard level. If you experience apparent duplicate records, this is not due to duplicate stream emissions—Keyspaces ensures each change generates a single stream record. Rather, duplicates result from your application re-reading or re-processing the same stream records multiple times due to lack of checkpoint management.

Why does incomplete data retrieval occur?

Incomplete retrieval in Keyspaces streams occurs when the logic misinterprets empty batches as the end of available data. Developers commonly implement loops that break upon receiving an empty response, assuming no more records exist. However, empty batches are normal in distributed streaming systems and can result from temporal gaps between write batches, internal system coordination, CDC processing delays, or shard-level synchronization. The actual end-of-shard signal is a null or absent nextShardIterator field in the API response, not an empty record set. When code terminates on empty batches instead of checking iterator validity, it stops prematurely while additional records remain accessible further in the stream.



Cost optimization strategy

Optimize GetRecords polling frequency

Reduce API costs by adjusting your GetRecords polling frequency based on your workload requirements. Each GetRecords call incurs charges, so frequent polling during periods of low or no CDC activity can lead to unnecessary expenses.

References:

[+] https://docs.aws.amazon.com/keyspaces/latest/devguide/keyspaces-enable-cdc-alter-table.html

[+] https://docs.aws.amazon.com/keyspaces/latest/devguide/keyspaces-records-cdc.html

[+] https://docs.aws.amazon.com/keyspaces/latest/devguide/cdc_how-to-use-kcl.html

[+] https://docs.aws.amazon.com/whitepapers/latest/cost-modeling-data-lakes/cost-optimization-in-analytics-services.html

AWS
EXPERT
published 16 days ago63 views