Identifying and Resolving Hot Partition Issues in Amazon Keyspaces
Amazon Keyspaces (for Apache Cassandra) distributes data across multiple backend partitions to provide scalability and high performance. However, when certain partitions receive disproportionately high traffic, they can become "hot partitions," leading to throttling and timeout exceptions. This article provides a comprehensive guide to identifying, understanding, and resolving hot partition issues in Amazon Keyspaces.
Understanding Partition-Level Throttling
What is a Hot Partition?
Amazon Keyspaces distributes data across multiple partitions at the backend, with each partition having strict throughput limits:
- Write Capacity: 1,000 WCU (Write Capacity Units) per second per partition
- Read Capacity: 3,000 RCU (Read Capacity Units) per second per partition
- [+] https://docs.aws.amazon.com/keyspaces/latest/devguide/bp-partition-key-design.html
These are hard limits that cannot be increased or influenced by customers. When a single partition receives traffic exceeding these limits, it results in throttling, which manifests as "StoragePartitionThroughputCapacityExceeded" exceptions at the client side.
Causes of Throttling:
Throttling in Keyspaces tables occurs due to two primary scenarios:
- For tables with Provisioned Capacity Mode:
- Exceeding the provisioned capacity allocated for the table
- Exceeding the partition-level throughput limit
- For tables with On-Demand Capacity Mode:
- Consuming more than double the previous traffic peak for the table within 30 minutes
- Exceeding the partition-level throughput limit
One of the most challenging aspects of diagnosing hot partition issues is that CloudWatch metrics can be misleading. Here's why: Key Understanding: CloudWatch provides Keyspaces metrics at an aggregate of one-minute intervals, while throughput is consumed on a per-second basis, and partition limits are enforced on a per-second basis.
How to Investigate
To properly identify hot partition issues:
- Check CloudWatch Metrics for StoragePartitionThroughputCapacityExceeded and WriteThrottleEvents or ReadThrottleEvents
- Analyze application logs for timeout exceptions occurring during write or read operations
- Look for spiky traffic patterns where certain seconds show dramatically higher activity than others
Understanding On-Demand Capacity Mode Behavior
Amazon Keyspaces tables using on-demand capacity mode automatically adapt to your application's traffic volume, but with specific constraints:
The Double Peak Rule On-demand capacity mode instantly accommodates up to double the previous peak traffic on a table within 30 minutes. Example Traffic Pattern:
- Your application's traffic varies between 5,000 and 10,000 LOCAL_QUORUM writes per second
- Previous peak: 10,000 writes per second
- On-demand instantly accommodates: Up to 20,000 writes per second
- If sustained at 20,000 writes per second, the new peak becomes 20,000
- Next accommodation level: Up to 40,000 writes per second
Exceeding the Double Peak Limit If you need more than double your previous peak, Amazon Keyspaces automatically allocates more capacity as your traffic volume increases. However, you will observe insufficient throughput capacity errors if you exceed double your previous peak within 30 minutes.
Solutions and Best Practices
- Randomize Data Distribution
If your data is sorted by partition key value, consider randomizing data before importing to Keyspaces. This helps distribute writes across multiple random partitions and allows achieving higher throughput. Implementation: Add a random suffix or prefix to your partition keys to ensure better distribution across partitions.
- Temporarily Increase Provisioned Capacity
For provisioned capacity tables, temporarily increasing the provisioned capacity to a higher number can help:
- Increases the number of backend table partitions by splitting existing partitions into child partitions
- Distributes data across child partitions
- Shrinks the hash range value for each partition
- Allows writes to get distributed across more partitions
- Shape Your Traffic
Gradually increase requests per second rather than creating sudden traffic spikes:
- For On-Demand mode: Avoid exceeding double your previous peak traffic within 30 minutes
- Implement gradual ramp-up of traffic to allow the system to adapt
- Monitor traffic patterns and plan capacity increases accordingly
- Implement Write Sharding
Design your application to ensure uniform activity across all logical partition keys in the table. The partition key portion of a table's primary key determines the logical partitions in which data is stored. Key Principle: Distribute I/O requests evenly to avoid creating "hot" partitions that result in throttling and use your provisioned I/O capacity inefficiently. Reference: Write Sharding Best Practices
- Optimize Spark Connector Configurations
For applications using Apache Spark with Keyspaces, implement these best practices: spark.cassandra.output.batch.size.rows = 1 spark.cassandra.output.batch.grouping.key = none spark.cassandra.output.batch.grouping.buffer.size = 100
Benefits:
- Turning off batching improves random access patterns
- Better distribution of writes across partitions
- Reduced likelihood of hitting partition-level limits
Understanding Retry Options
Throttling exceptions in AWS Keyspaces are translated to timeout exceptions at the client side. Implementing proper retry logic is essential for handling transient throttling issues.
Application-Level Retries: For Java applications, you can enable automatic retries by setting idempotence to true. Recommended Retry Configuration:
- Number of retries: 10
- Backoff strategy: Exponential backoff
- Starting wait time: 10ms
- Maximum wait time: 100ms
- Total retry time: Approximately 1 second
Reference: Error Retries and Exponential Backoff
Driver-Level Retries: The default AmazonKeyspacesExponentialRetryPolicy is conservative with:
- Max attempts: 3
- Minimum wait: 10ms
- Maximum wait: 100ms
Enhanced Retry Policy Configuration: For better handling of hot partition scenarios, consider using the AmazonKeyspacesExponentialRetryPolicy with enhanced settings:
retry-policy {
class = com.aws.ssa.keyspaces.retry.AmazonKeyspacesExponentialRetryPolicy
max-attempts = 3
min-wait = 100 ms
max-wait = 2000 ms
}
Rationale: Since hot partition throttling often lasts for only a second or two, longer backoff times (up to 2000ms) help spread writes across multiple seconds, avoiding repeated hits to the same partition during its throttled period.
Adaptive Retry Strategy: When throttling errors are detected:
- Back off for 1-2 seconds before retrying
- This allows the heavy influx on the partition to subside
- Spreads writes across multiple seconds
- Helps avoid exceeding the partition limit on retry
References
- Error Retries and Exponential Backoff : https://docs.aws.amazon.com/general/latest/gr/api-retries.html
- Write Sharding Best Practices : https://docs.aws.amazon.com/keyspaces/latest/devguide/bp-partition-key-design.html
- Troubleshooting Serverless Issues : https://docs.aws.amazon.com/keyspaces/latest/devguide/troubleshooting.serverless.html
- Amazon Keyspaces Retry Policy Documentation : https://docs.aws.amazon.com/keyspaces/latest/devguide/connections.html#connections.retry-policies
- Topics
- Database
- Tags
- Amazon Keyspaces
- Language
- English
Relevant content
- Accepted Answerasked 2 years ago
AWS OFFICIALUpdated 2 years ago
AWS OFFICIALUpdated 2 years ago