Help us improve the AWS re:Post Knowledge Center by sharing your feedback in a brief survey. Your input can influence how we create and update our content to better support your AWS journey.
Best Practices for Write and Delete Performance in Amazon Keyspaces for Apache Cassandra
This article provides an optimization guide focused on improving write throughput and delete operation efficiency in Amazon Keyspaces. It covers batch writing strategies, efficient deletion patterns for large datasets, and tombstone management techniques.
Introduction :
Amazon Keyspaces is a fully managed, serverless database service that provides Apache Cassandra compatibility without the operational overhead. The service eliminates the need to provision, patch, or manage servers. This enables you to build applications that handle thousands of requests per second with virtually unlimited throughput and storage capacity. While Amazon Keyspaces automatically scales to accommodate your workload, optimizing write and delete operations is essential for achieving optimal performance. This article provides comprehensive guidance on best practices for write and delete performance in Amazon Keyspaces.
Best Practices for Write Performance :
1. Using Batch Statements :
Amazon Keyspaces supports two distinct types of batch operations - logged and unlogged batches., Each serving different use cases and performance requirements[1].
Unlogged batches process multiple operations as a single request without maintaining a batch log. They are particularly efficient when operations are confined to a single partition and when reducing network traffic is a priority. While unlogged batches offer better performance, they come with the possibility that some operations within the batch might succeed while others fail. This makes them ideal for high-throughput applications where partial success is acceptable and maximum performance is required.
Logged batches, on the other hand, provide atomic guarantees by combining multiple write actions into a single operation. This means all actions in the batch either succeed together or fail together, making them ideal for scenarios requiring strong consistency. Logged batches support writes across multiple Amazon Keyspaces tables within the same AWS account and Region. While they offer stronger consistency guarantees, they typically have slightly higher latencies compared to unlogged batches. When planning capacity for logged batches, it's important to note that each row requires twice the write capacity of standard operations. Despite this increased capacity requirement, Amazon Keyspaces doesn't charge additional costs for using logged batches, you only pay for the actual writes performed.
Choose logged batches when atomic operations and cross-table consistency are crucial for your application. Opt for unlogged batches when performance is paramount and operations are primarily within a single partition. This balanced approach to batch operations allows you to optimize write performance while meeting your application's specific consistency and throughput requirements.
When implementing batch operations in Amazon Keyspaces, following established best practices ensures optimal performance and reliability.
Enable automatic scaling for your tables to handle batch operations efficiently, particularly for logged batches which require additional throughput capacity. This proactive scaling approach helps prevent throttling and ensures consistent performance during peak workloads.
For operations that can run independently without affecting application correctness, use individual operations or unlogged batches. This approach offers better performance and resource utilization compared to logged batches. When dealing with high throughput bulk data ingestion where atomic guarantees aren't necessary, individual write operations or unlogged batches are the recommended choice.
Application design should minimize concurrent updates to the same rows. Simultaneous batch operations targeting identical rows can lead to conflicts and operation failures. Structure your write patterns to avoid these scenarios by implementing proper request distribution and timing strategies.
These practices, when implemented together, help maintain optimal performance while ensuring reliable data operations in your Amazon Keyspaces implementation. Regular monitoring and adjustment of these strategies based on your application's performance metrics will help maintain efficient batch operations over time.
2.Data Modeling:
Amazon Keyspaces logically partitions data based on your partition key, and each partition has a throughput limit of 1,000 Write Capacity Units per second. If your write workload heavily targets a single partition key, you must re-evaluate your access patterns as we don't support more than 1000 writes per second on the same row. To avoid hot partitions, ensure your partition keys have high cardinality with many distinct values that naturally distribute writes. If you must write heavy volumes to a single logical entity, consider "write sharding" by appending a random shard number (e.g., 1-10) to the partition key[3]
Best Practices for Delete Performance :
1. Time-To-Live (TTL) feature for Automatic Data Expiration :
Amazon Keyspaces Time to Live (TTL) helps you simplify application logic and optimize storage costs by automatically expiring and deleting data from tables based on a configured TTL value. This feature makes it easier to comply with data retention policies and regulatory requirements that define how long data must be retained or when it must be deleted. You can set a default TTL value for the entire table and override it for individual rows or columns, and TTL operations don't impact your application's performance or table availability. Amazon Keyspaces automatically filters out expired data from query results and typically deletes it from storage within 10 days of expiration, after which you stop incurring storage fees.[4]
2. Avoid Wide Partition Deletes :
Deleting an entire partition with many rows generates numerous tombstones and can significantly impact performance. If you need to remove large amounts of data from a partition, consider your data model design. For scenarios requiring bulk deletion, it may be more efficient to use TTL or design your schema so that entire tables can be dropped and recreated rather than deleting massive amounts of data from within partitions.
3.Bulk Cleanup - DROP :
If you need to wipe an entire table or delete most of your data., and If you don't need the table schema anymore, DROP TABLE is a cleaner approach that removes the entire table structure along with its data. Using DROP instead of iterative deletes provides massive performance benefits and cost savings by avoiding the overhead of processing individual delete operations and tombstone accumulation. Always choose these built-in commands for bulk cleanup operations rather than attempting to delete data programmatically row by row.
4.Avoid Range Deletes :
If you're trying to delete data by partition key and if you're trying to delete more than 1,000 rows in one delete operation you would receive a range delete error. The problem with range deletes is that Keyspaces must perform a "read-before-write" operation internally to locate all rows matching the range condition, effectively doubling the work and consuming both read and write capacity. This often leads to timeouts, high latency or WriteThrottleEvents . Range delete requests are limited by the amount of items that can be deleted in a single range[6].
To delete more than 1,000 rows within a single partition, consider the following options :
- Delete by partition – If the majority of partitions are under 1,000 rows, you can attempt to delete data by partition. If the partitions contain more than 1,000 rows, attempt to delete by the clustering column instead.
- Delete by clustering column – If your model contains multiple clustering columns, you can use the column hierarchy to delete multiple rows. Clustering columns are a nested structure, and you can delete many rows by operating against the top-level column.
- Delete by individual row – You can iterate through the rows and delete each row by its full primary key (partition columns and clustering columns).
- As a best practice, consider splitting your rows across partitions – In Amazon Keyspaces, we recommend that you distribute your throughput across table partitions. This distributes data and access evenly across physical resources, which provides the best throughput. For more information, refer to [7]
Consider also the following recommendations when you're planning delete operations for heavy workloads.
- With Amazon Keyspaces, partitions can contain a virtually unbounded number of rows. This allows you to scale partitions “wider” than the traditional Cassandra guidance of 100 MB. It’s not uncommon for time series or ledgers to grow over a gigabyte of data over time.
- With Amazon Keyspaces, there are no compaction strategies or tombstones to consider when you have to perform delete operations for heavy workloads. You can delete as much data as you want without impacting read performance.
Tombstone Accumulation :
When you delete data in Amazon Keyspaces, the system doesn't immediately remove the data from disk. Instead, it writes a tombstone which is a marker indicating that the data has been deleted. Tombstones remain until compaction occurs and can accumulate if you perform many deletes. The service removes tombstoned data automatically (typically within 10 days), As you continue to perform reads and writes on rows that contain tombstoned data, the tombstoned data continues to count towards storage, read capacity units (RCUs), and write capacity units (WCUs) until it's deleted from storage. Excessive tombstones degrade read performance because the system must scan through them to find valid data. While Amazon Keyspaces handles compaction automatically, understanding tombstone trends helps you identify potential performance issues before they impact your application. High tombstone counts relative to live data suggest that your data model or delete patterns may need adjustment.[8]
References :
[1] Use batch statements in Amazon Keyspaces -https://docs.aws.amazon.com/keyspaces/latest/devguide/best-practices.html
[2] Write consistency levels - https://docs.aws.amazon.com/keyspaces/latest/devguide/consistency.html#WriteConsistency
[3] Use write sharding to evenly distribute workloads across partitions - https://docs.aws.amazon.com/keyspaces/latest/devguide/bp-partition-key-sharding.html
[4] Time-To-Live (TTL) in Amazon Keyspaces - https://docs.aws.amazon.com/keyspaces/latest/devguide/TTL.html
[5] Amazon Keyspaces capacity modes - https://docs.aws.amazon.com/keyspaces/latest/devguide/ReadWriteCapacityMode.html
[6] Range deletes - https://docs.aws.amazon.com/keyspaces/latest/devguide/troubleshooting.cql.html#troubleshooting.cql.rangedelete
[7] Data modeling best practices: recommendations for designing data models - https://docs.aws.amazon.com/keyspaces/latest/devguide/data-modeling.html
[8] Client-side timestamps in Amazon Keyspaces - https://docs.aws.amazon.com/keyspaces/latest/devguide/client-side-timestamps.html
- Topics
- Database
- Tags
- Amazon Keyspaces
- Language
- English
Relevant content
- Accepted Answerasked 2 years ago
- asked 3 years ago
AWS OFFICIALUpdated 2 years ago
AWS OFFICIALUpdated 3 years ago