Skip to content

AWS Kinesis Data Streams with Lambda

0

Hello!

I have a system design / architecture problem.

Context: The ultimate goal is to ingest battery data from an MQTT broker -> process this data to keep track of each data points state (charging, discharging, and idle) for each battery -> send to data bucket -> extract data from data bucket on latest valid segment of accumulated battery data that was in charging state -> send to algorithm on sagemaker. Our goal is to make an efficient, cost-saving, and can get the job done as soon as possible approach for the customer (get a finished product out as soon as possible).

Numbers: I am receiving a data point every 20 seconds from 150 separate devices, each around 500B from IoT Core. On the later half of the architecture, I only need to query from the data bucket to run the data through sagemaker algorithm ONCE a month. Therefore, I don't really need real-time processing of this data. I want to reduce cost as much as possible. Also, after doing calculations, all data can be stored in one shard in Kinesis with a max storage of 1 month.

Current approach: Ingest data first in IoT Core -> send data to Kinesis data streams -> PROBLEM (explained below) -> send processed data to S3 -> query from S3 with Athena -> send queried data to SageMaker.

PROBLEM: My FIRST approach was to use Amazon Managed Service for Apache Flink. However, it is a fairly complex system and there are migrations in terms of versions that make it confusing on what is usable. I want to try to stick to Python and it seems like PyFlink has not integrated yet with the new version. Flink would keep track of the state and assign segment ID's (batched data that make up a state) seamlessly I believe, but would take time to learn this tech stack.

My SECOND approach is to use Lambda to process batched data, keep track of state for each battery in DynamoDB, and add/process data according to these states over time and send to S3. This would mean I would aggregate data in Kinesis data streams (possibly for a day), extract this data from the stream using Lambda for each battery separately, process by adding state fields and a segment ID, then send to S3. However I do not know if this is feasible.

I am open to any suggestions (completely different architecture), advice, or any questions that could clarify my problem.

1 Answer
1

Based on your requirements and constraints, using AWS Lambda with Kinesis Data Streams can be an effective and cost-efficient solution for your battery data processing system. Here's how you can approach this:

  1. Data Ingestion: Continue using IoT Core to ingest data from your MQTT broker and send it to Kinesis Data Streams.

  2. Data Processing with Lambda:

    • Configure Lambda to process records from Kinesis Data Streams using a standard iterator. This is more cost-effective for your use case since you don't need real-time processing.
    • Set up a batching window in Lambda to buffer records for up to 5 minutes before invoking the function. This can help reduce the number of Lambda invocations and associated costs.
    • Implement a Lambda function that processes the batched data, keeping track of each battery's state (charging, discharging, idle) and assigning segment IDs.
  3. State Management:

    • Use DynamoDB to store the current state and metadata for each battery. This approach is simpler and more cost-effective than using Apache Flink for your requirements.
    • In your Lambda function, read the current state from DynamoDB, update it based on the new data, and write the updated state back to DynamoDB.
  4. Data Storage:

    • After processing the data in Lambda, send the enriched data (including state information and segment IDs) to S3.
    • Organize the data in S3 using a partitioning scheme that facilitates efficient querying (e.g., by date and battery ID).
  5. Monthly Processing:

    • Since you only need to query the data once a month, you can use AWS Glue or Athena to run your monthly queries on the data in S3.
    • Extract the required data (latest valid segment of accumulated battery data in charging state) and send it to SageMaker for processing.

This approach leverages Lambda's ability to process Kinesis Data Streams efficiently while using DynamoDB for state management. It's simpler to implement than Apache Flink, especially if you're more comfortable with Python. The use of batching and standard iterators in Lambda helps reduce costs, and the overall architecture meets your requirements for efficient, cost-effective processing without the need for real-time analysis.

Remember to configure your Lambda function with appropriate memory and timeout settings to handle the data volume efficiently. Also, consider implementing error handling and retry mechanisms to ensure data integrity and processing reliability.

Sources
Process Amazon Kinesis Data Streams records with Lambda - AWS Lambda
How Lambda processes records from Amazon Kinesis Data Streams - AWS Lambda
Develop consumers using AWS Lambda - Amazon Kinesis Data Streams

answered a year ago
EXPERT
reviewed a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.