When I review the Amazon CloudWatch metrics for my Amazon DynamoDB workloads, the maximum latency metric is high. But, average latency is normal.
When you analyze the CloudWatch metric SuccessfulRequestLatency, it's a best practice to check the average latency. Maximum latency doesn't give a picture of overall latency on your DynamoDB table. Instead, it shows the maximum time taken by a single request in that period. For example, if you have 100 requests on a DynamoDB table at one time, even if 99 requests take 10 ms and a single request takes 100 ms, then the maximum latency metric is 100 ms.
DynamoDB is a mass-scaled distribution system, with thousands of nodes in the backend fleet. So, a DynamoDB table might have multiple partitions in the tablespace, and each partition has multiple copies in the backend fleet. When you make an API call to DynamoDB, the DynamoDB service endpoint receives your call, and then routes it to one of the back end nodes for processing. When the call is successfully processed, DyanamoDB routes the results back to your client.
In most cases, the API call is successfully processed in a single attempt, and you observe small latency on the client side. But, sometimes the first attempt fails if the back end node is experiencing:
- A busy period
- Partition split
- Connectivity issues
In cases like these, the first attempt fails within a timeout on the server side (5000 ms). Then, the server automatically retries the API call on another node, often multiple times. The server returns the result back to your client when the API call is successfully processed. When this happens, you observe elevated latency for that particular request.
So, a high maximum latency metric is generally not a cause for concern. If the DynamoDB service observes a consistently high latency from one node, then the service automatically removes that component from the back end fleet. You might observe an elevated level of latency for a certain percentage of API calls when the previously mentioned localized failure occurs on the service side. This is reflected in a high level of maximum SuccessfulRequestLatency in the CloudWatch metrics for the related DynamoDB tables. For this reason, localized failures can increase your maximum latency, but you do not need to take any action to control this failure.
But, you can configure your application to react quickly by failing fast with exponential backoff retry. This means that the new request hits the new node, and you get faster results. For more information, see Tuning AWS Java SDK HTTP request settings for latency-aware Amazon DynamoDB applications.
How can I troubleshoot high latency on an Amazon DynamoDB table?
Latency metrics logging