Skip to content

High P99 Latency on DynamoDB via VPC Endpoint during Low TPS (Java CRT Client)

0

We are observing a paradox where our DynamoDB latency is significantly worse during low traffic periods (5-20 TPS) compared to high traffic periods (60-100).

  • High TPS: 15-20ms(Healthy)
  • Low TPS: 40-45ms (Breaching our 50ms API SLA)

AWS Support suggests this is a "cold cache" or connection reuse issue. Despite tuning the AWS CRT Java SDK client, we cannot eliminate these latency spikes at low volumes.

Architecture

  • Compute: AWS ECS (Fargate/EC2)
  • Database: DynamoDB via VPC Gateway Endpoint
  • Client: AWS SDK for Java v2 (using the CRT HTTP Client)
  • Access Pattern: Fetching 6 Items per API request (either via 2 parallel Query calls or 1 BatchGetItem).

Current Configuration & Attempts

We have experimented with the following CRT settings with no significant improvement in the low-TPS spikes:

  • Max Concurrency: Tested 10, 50, 100, 200.
  • Connection Settings: Adjusted connectionMaxIdleTime, connectionKeepAliveConfiguration, and TCP_KEEPALIVE.
  • Pattern: Both parallel queries and BatchGetItem exhibit the same behavior.

Could you please help with the below queries:

  1. Is there a "sweet spot" for connectionMaxIdleTime or maxConcurrency specifically for the CRT client to keep connections "warm" during low TPS?
  2. Does the VPC Gateway Endpoint introduce specific connection-drain behaviors at low volumes that we should account for?
  3. Would switching from the CRT client to the Netty NIO client (with specific connectionAcquisitionTimeout tuning) provide more stability for this low-latency SLA?
  4. Are there specific JVM-level or SDK-level "heartbeat" strategies to prevent the connection pool from going cold when traffic dips?
1 Answer
0

Based on the behavior you're describing, here are some insights that may help address your low-TPS latency issues:

Understanding the Latency Pattern: Some variation in latency metrics, particularly in higher percentiles like P99, is expected in DynamoDB and can result from background operations that maintain high availability and durability, or transient infrastructure issues. However, the consistent pattern you're seeing during low traffic suggests connection management issues rather than DynamoDB service-side problems.

Connection Reuse and Configuration: The AWS SDK for Java 2.x emphasizes the importance of reusing service clients to improve performance by reducing time spent establishing new connections. When clients are not reused effectively or connections expire during low-traffic periods, you'll see the "cold start" latency you're experiencing. For your specific questions:

  1. Connection Settings: While there isn't a universally prescribed "sweet spot" for connectionMaxIdleTime or maxConcurrency specific to CRT, the key is balancing connection retention with resource efficiency. Consider setting connectionMaxIdleTime high enough to span your low-traffic periods (potentially several minutes) to prevent premature connection closure. However, be mindful that keeping too many idle connections can consume resources.

  2. VPC Gateway Endpoint Behavior: VPC Gateway Endpoints themselves don't introduce specific connection-drain behaviors that would cause this pattern. The endpoint acts as a routing mechanism and shouldn't be the primary cause of your latency variance between high and low TPS periods.

  3. CRT vs. Netty Client: Both clients should be capable of maintaining connection pools effectively. The choice between CRT and Netty is typically based on performance characteristics at scale rather than solving cold-connection issues. The underlying problem—connection pool management during low traffic—exists regardless of the HTTP client implementation.

  4. Keeping Connections Warm: Consider implementing application-level strategies such as periodic health checks or background requests during low-traffic periods to keep the connection pool active. This "synthetic traffic" approach can prevent connections from going cold without requiring SDK-level modifications.

Additional Recommendations:

  • Configure appropriate API timeouts (apiCallAttemptTimeout and apiCallTimeout) to ensure your application fails fast during network issues rather than waiting and accumulating latency.
  • Monitor and log request IDs for slow requests, as this information is valuable if you need to escalate with AWS Support.
  • Remember that the SuccessfulRequestLatency metric only measures DynamoDB service-internal latency—client-side activity and network trip times aren't included. Enable SDK latency metric logging to get a complete picture of where latency is occurring in your request path.

For streaming operations or any SDK operations, ensure you're properly closing resources to prevent connection pool exhaustion, which could exacerbate latency issues during traffic fluctuations.
Sources
Troubleshooting latency issues in Amazon DynamoDB - Amazon DynamoDB
Best practices for using the AWS SDK for Java 2.x - AWS SDK for Java 2.x

answered 2 months ago
AWS
EXPERT
reviewed 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.