How do I troubleshoot ReadTimeout and WriteTimeout exceptions in Amazon Keyspaces?

5 minute read
0

I want to troubleshoot timeout exceptions in Amazon Keyspaces (for Apache Cassandra).

Resolution

Unlike Apache Cassandra that's designed to run a cluster on a fleet of nodes, Amazon Keyspaces is serverless. Apache Cassandra doesn't have exceptions that are related to serverless features, such as capacity. Most Apache Cassandra driver implementations handle only errors that are available in Apache Cassandra. Therefore, Amazon Keyspaces must use the same error codes to maintain compatibility.

For information on troubleshooting connection issues, see Troubleshooting connections in Amazon Keyspaces.

To troubleshoot timeouts in Keyspaces, use Amazon CloudWatch monitoring. To view table metrics, you can use the graphs in the Amazon Keyspaces console. Select the table, and then choose the Monitor tab. For information on using CloudWatch to monitor available Apache Cassandra metrics, see Amazon Keyspaces metrics and dimensions.

You can also use an AWS CloudFormation template (from the GitHub website) to deploy a CloudWatch dashboard to monitor your keyspaces or individual tables. You can monitor the following metrics:

  • PerConnectionRequestRateExceeded
  • StoragePartitionThroughputCapacityExceeded
  • ReadThrottleEvents and WriteThrottleEvents

PerConnectionRequestRateExceeded

The PerConnectionRequestRateExceeded metric measures Amazon Keyspaces requests that exceed the per connection request rate quota. One connection to a peer can support up to 3,000 CQL requests per second. For more information, see Quotas for Amazon Keyspaces (for Apache Cassandra). When you connect to a public endpoint, nine peers are available to connect. Therefore, the default limit is 27,000 CQL requests per second with the default driver setting of one connection per peer.

When using an Amazon Virtual Private Cloud (Amazon VPC) endpoint, the number of peers depends on the number of Availability Zones where the endpoint has an interface. For example, ap-southeast-2 includes three Availability Zones. In this Region, with one connection per peer configured, your application is limited to 9,000 CQL requests per second. To increase the number of CQL requests that are allowed per second, increase the number of connections that are made per peer. For example, using a public endpoint to set the number of connections to two per peer provides 9 * 3,000 * 2 = 54,000 CQL requests.

If you're using Datastax Java driver v3, then make sure that your cluster includes the following pooling options:

Cluster cluster =Cluster.builder()
        .addContactPoint("cassandra.ap-southeast-2.amazonaws.com")
        .withPort(9142)
        ...
        .withPoolingOptions(new PoolingOptions().setConnectionsPerHost(HostDistance.LOCAL,9,9))
        .build();

If you're using Datastax Java driver v4, then make sure that application.conf includes the following:

datastax-java-driver {  
    advanced.connection {    
        pool {      
            local {        
                size = 9      
            }    
        }  
    }
}

You can't use the v3 or v4 protocol that Amazon Keyspaces requires to increase the number of connections per peer with the Python driver. If you're using Datastax Java v4 driver, then turn off hostname-validation. If you turn it on, then the driver connects to only one peer, and severely limits the total number of requests per second.

The application.conf file must include the following:

datastax-java-driver {  
    advanced {    
        ssl-engine-factory {      
            class = DefaultSslEngineFactory      
            hostname-validation = false    
        }  
    }
}

When connecting through an Amazon VPC endpoint, add permissions to your AWS Identity and Access Management (IAM) policy to allow Amazon Keyspaces to query your VPC and endpoint information. This information is required to populate the system.peers table. The IAM policy must have the following permissions:

{
        "Version": "2012-10-17",
        "Statement": [{
        "Sid": "ListVPCEndpoints",
        "Effect": "Allow",
        "Action": ["ec2:DescribeNetworkInterfaces", "ec2:DescribeVpcEndpoints"],
        "Resource": "*"
        }]
}

If the IAM policy doesn't have these permissions, then your driver can connect to only one host that limits the total number of requests per second.

StoragePartitionThroughputCapacityExceeded

Th StoragePartitionThroughputCapacityExceeded metric measures requests that exceed a partition's 3000 RCU/RRU and 1000 WCU/WRU limit. This means that your current traffic patterns are focused on one or a few partitions, instead of evenly distributed across multiple partitions. To resolve this issue, alter your traffic patterns. For more information, see Data modeling in Amazon Keyspaces (for Apache Cassandra).

ReadThrottleEvents and WriteThrottleEvents

The ReadThrottleEvents and WriteThrottleEvents metrics measure requests that exceed the available capacity for a table or an AWS account. Note that any throttles that are caused because of PerConnectionRequestRateExceeded are also included in the metrics.

If you're using provisioned capacity, then increase the configured read or write provisioned capacity. Or, use AWS Auto Scaling. If you use AWS Auto Scaling, then increase the maximum available capacity. This might require increasing the account table or AWS Region limit to allow higher scaling. If you experience traffic spikes that Auto Scaling can't handle, then use On-Demand Capacity Reservations instead.

AllNodesFailedException or NoHostAvailable error

When an issue occurs, such as a read or write timeout, Apache Cassandra drivers try to connect to a different peer by default. For an Apache Cassandra cluster, it can help mitigate against a single node that's causing the issue. However, for Amazon Keyspaces, the new node might encounter the same issue. Then, Amazon Keyspaces might move into the next peer until all peers are exhausted. In this case, you see either the AllNodesFailedException or NoHostAvailable error message.

It's a best practice to configure the driver to remain with the host and implement an exponential backoff and retry mechanism. For example configurations, see amazon-keyspaces-java-driver-helpers on the GitHub website.

AWS OFFICIAL
AWS OFFICIALUpdated 10 months ago