Managing State in Lambda Functions Processing Kinesis Data Streams

0

Hello everybody,

I am developing an application that processes messages from an Amazon Kinesis Data Stream using AWS Lambda functions. My current approach assumes that the same instance of a Lambda function will handle all messages from a single shard, allowing me to store intermediate data within the function’s temporary storage (e.g., global variables) for the duration of the instance's life.

I am reaching out to confirm whether this assumption is accurate and if this approach to state management within a Lambda function is advisable.

Specifically, I am concerned about the following:

  1. Instance Reuse Across Shard Messages: Can I reliably expect that the same Lambda function instance will process all messages from a particular shard consecutively, allowing for effective use of in-memory storage for state management? Is there a possibility that during the processing of messages from the same shard, one instance may be replaced by another, potentially disrupting the continuity of state? If so, how might this impact the effectiveness of using in-memory storage for state within a single processing sequence?

  2. Implications of Scaling: How does the scaling behavior of Lambda functions, in response to fluctuations in the volume of incoming messages, affect the continuity of processing messages from the same shard? Specifically, I am interested in understanding how Lambda's scaling could impact state management within my application.

To illustrate my concern: If I am processing a sequence of related messages from a single shard, and relying on the assumption that they will be handled by the same instance of my Lambda function, what risks or limitations should I be aware of?

Thank you! Best, Mikhail

Mikhail
asked 21 days ago536 views
1 Answer
0

Your assumption is incorrect. There is no affinity between a shard and the specific Lambda instance that handles that shard. As to scaling, there will be between 1-10 instances running concurrently per each shard, depending on the configuration of the Parallelization Factor parameter in the event source mapping.

To achieve what you want you have two options:

  1. When using a Time Window you can return from an invocation a state object that you get back in the next invocation in the same shard. Note that the object is initialized every time a new window starts, so this can't be used if the data you need to processes takes longer the the window.
  2. Maintain the state outside the function, e.g., a DynamoDB table, and read it in each invocation and update it at the end of the invocation.
profile pictureAWS
EXPERT
Uri
answered 21 days ago
profile pictureAWS
EXPERT
reviewed 21 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions