I want to know why my Amazon DynamoDB streams take so long to process.
Resolution
When you use AWS Lamdba to consume DynamoDB streams, you might have records that are written to a table but that aren't processed by Lambda. These records might even take several hours to process. To troubleshoot the cause, review the Lambda IteratorAge metric.
When the IteratorAge increases, Lambda isn't efficiently processing records that are written to the DynamoDB stream. The IteratorAge metric can increase for the following reasons.
Invocation errors
Lambda is designed to process batches of records in sequence and retry on errors. If a function returns an error every time that the function is invoked, then the Lambda function continues to try to process the records. The function continues to try to process the records until the records expire or the records exceed the maximum age that is configured on the event source mapping. Because the records present in the streams aren't being processed, there might be a spike in the IteratorAge metric. You can check the Lambda errors metric for spikes in the errors metric. Or, review your Lambda logs for more information.
To prevent the invocation errors from causing a spike in IteratorAge metric, send the records to a dead-letter queue (DLQ). When you create the event source mapping, enter the following information for the specific parameters:
- Retry attempts: This parameter is the number of times that the lambda function runs when the function has an error. Update this value based on your use case, and send the records to the DLQ after the function retries a specific number of times.
- Split batch on error: This parameter splits a batch into two batches before the function retries the batch. Use this parameter to split the batch and continue the invocation with the smaller batches. Splitting the batch can help isolate the bad records that cause the invocation error.
Throttling errors
When you use DynamoDB streams, it's a best practice to have no more than two consumers per shard to avoid throttling. Because Lambda processes records sequentially, Lambda can't continue to process records if a current invocation is throttled. If you must use more than two consumers, use Amazon Kinesis or a fan-out pattern. Or, use a concurrency limit in Lambda to prevent throttling.
Lambda throughput
Lambda throughput is the amount of
If the Duration metric value is high, then the function's throughput is low and results in the increase in IteratorAge metric for the function. To decrease the runtime duration for the function, increase the memory allocated to a function.
You can also optimize your function code so that less time is needed to process records.
Concurrent Lambda runs
Concurrency is the number of in-flight requests that your Lambda function is handling at the same time. An increase in the number of concurrent Lambda invocations can improve the processing of your stream records. For each concurrent request, Lambda provisions a separate instance of your execution environment. The maximum number of concurrent Lambda runs are calculated as follows:
Concurrent Runs = Number of Shards x Concurrent batches per shard (parallelization Factor)
In a DynamoDB stream, there is 1 to 1 mapping between the number of partitions of the table and the number of stream shards. The number of partitions is determined by the size of the table and its throughput. Each partition on the table can serve up to 3,000 read request units or 1,000 write request units or the linear combination of both. So, to increase concurrency, increase the table's provisioned capacity to increase the number of shards. You can configure the number of concurrent batches per shard in the event source mapping. The default is 1 and can be increased up to 10. For example, if the table has 10 partitions, and the Concurrent batches per shard is set to 5, then you can have up to 50 concurrent runs.
To process item level modification in the right order at any given time, items with the same partition key go to the same batch. Your table partition key must have high cardinality and your traffic can't generate hot keys. For example, if you set Concurrent batches per shard value to 10, and your write traffic targets a single partition key, then you can have only one concurrent run per shard.
Batch size and Batch window
Lambda batching behavior can lower the throughput of the function when not correctly tuned. To optimize the processing of your DynamoDB streams, adjust the batching window and batch size based on your use case.
If the batch size is set at too low of a value, then the function is frequently invoked with a small number of records. These frequent invokes slow down the processing of the stream. If the batch size value is set too high, then your function duration might increase.
If the batch window value and the number of records in that window aren't optimized, then you might have unnecessary invocations of the same shard. These unnecessary invocations can slow down the processing of the stream. If the batch window value is too high, then you might wait longer for the function to process streams. This longer processing time increases the IteratorAge.
To increase your throughput and decrease the IteratorAge, it's a best practice to test multiple combinations of batching window and batch size.
Related information
Performance metrics
Understanding Lambda scaling and throughput
Understanding Lambda function scaling
Using the DynamoDB Streams Amazon Kinesis Adapter to process stream records
DynamoDB Streams and Lambda triggers