Uniqueness checks when streaming data via firehose to s3

0

Hi there! I'm streaming a bunch of json data into an s3 bucket via firehose. I was wondering what's the best practice for uniqueness checks on those individual json objects since firehose aggregates them into one s3 object. (If firehose didn't aggregate i guess i could identify the uniqueness of the objects by their respective key in the bucket) Thanks for reading!

2 Answers
0

Greetings,

Lambda functions can be a good choice for processing data in a streaming pipeline, but the specific concurrency limits of Lambda may need to be taken into account when designing the pipeline.

Lambda functions can be invoked in response to events, such as data being streamed to an Amazon Kinesis stream. Each Lambda function invocation processes a single event, and the concurrency limit determines how many invocations can run concurrently. If the pipeline receives more events than the concurrency limit allows, some events may need to wait for a Lambda function to become available.

To mitigate this issue, the pipeline can be designed to use multiple instances of Lambda functions or to batch events before invoking Lambda functions. Additionally, the pipeline can be designed to use other AWS services, such as Amazon Kinesis Data Analytics or Amazon EMR, which can handle larger amounts of data.

Overall, Lambda functions can be a good choice for processing data in a streaming pipeline, but the specific design of the pipeline should take into account the concurrency limits of Lambda to ensure that the pipeline can handle the expected data volume. Please let me know if that helped your question

AWS
EXPERT
ZJon
answered a year ago
  • Hey! thanks again for a thorough answer! It did help my question for sure. I don't understand what you mean by "designing a pipeline to use multiple instances of Lambda functions". What is a multiple instance of a lambda function?

  • Greetings, When we talk about multiple instances of a Lambda function, we mean that there are multiple copies of the same function running concurrently to process incoming events.

    AWS Lambda automatically scales the number of instances of a function based on the incoming event rate and the amount of memory allocated to the function. Each instance of the function runs independently and can process a single event at a time.

    Designing a pipeline to use multiple instances of Lambda functions involves breaking down the pipeline into smaller, independent stages, and configuring each stage to use a separate Lambda function instance. This allows for parallel processing of events across multiple instances of the function, which can improve throughput and reduce latency.

    For example, in a streaming pipeline that ingests, processes, and stores data, you might design the pipeline to use multiple instances of a Lambda function to handle the processing stage. Each instance of the function would process a subset of the incoming events in parallel, and the output of each instance could be combined or aggregated in a downstream service such as Amazon S3 or Amazon DynamoDB.

    By using multiple instances of a Lambda function in this way, you can achieve greater scalability, resilience, and cost efficiency in your pipeline architecture.

    Please let me know if i answered your question

  • Hey Zokir! Thanks again for answering! I don't understand - If "Lambda concurrency limits" references to the maximum number of concurrent lambdas you can run at a time (3000 or something) and "multiple instances of a Lambda function" refers to "scaling automatically the number of instances of a function", aren't these the same thing? how is latter a solution to the former ? (as suggested in your first answer). Thanks for your patience! Feel free to link me some documentation you may think i'm missing btw :)

  • Lambda concurrency limits refer to the maximum concurrent executions allowed for all Lambda functions in an AWS account within a region. Multiple instances of a Lambda function refer to the automatic scaling of a single Lambda function to handle multiple requests concurrently.

    To address Lambda concurrency limits, you can request an increase in the concurrency limit from AWS, allowing more instances of your Lambda functions to run simultaneously without hitting the account-level limit. Both concepts are related but address different aspects of Lambda usage. AWS provides a good overview of Lambda concurrency and how to manage it in their documentation: https://docs.aws.amazon.com/lambda/latest/dg/concurrent-executions.html

    I hope this clears up any confusion. If you have further questions, feel free to ask.

  • thanks!! this is tons of info for now :)

0

Greetings,

When streaming a bunch of JSON data into an S3 bucket via Firehose, it can be challenging to enforce uniqueness checks on individual JSON objects since Firehose aggregates them into one S3 object. However, there are a few options you can consider:

Include a unique identifier in each JSON object: One solution is to include a unique identifier in each JSON object before sending it to Firehose. You could use a UUID, timestamp, or other identifier that is guaranteed to be unique for each object. This way, even if the objects are aggregated into a single S3 object, you can still identify each object based on its unique identifier.

Use an AWS service for deduplication: You can use an AWS service like Kinesis Data Streams or Kafka to deduplicate the JSON objects before sending them to Firehose. Both Kinesis Data Streams and Kafka provide mechanisms for identifying and removing duplicates based on a unique identifier or other criteria.

Use a Lambda function to process the data: Another option is to use a Lambda function to process the data before it is sent to Firehose. The Lambda function can check each JSON object for uniqueness and either discard duplicates or send only unique objects to Firehose.

It's important to note that enforcing uniqueness checks on individual JSON objects can add some overhead and complexity to your pipeline. Therefore, you should consider the trade-offs and choose a solution that meets your requirements while minimizing the impact on performance and cost.

Please let me know if I answered your question or you need more help

AWS
EXPERT
ZJon
answered a year ago
  • Hey Zokir! Thank you for the thorough answer! It does indeed confirm what i understood of the problem. If i may ask a follow up, given a context where large amount of data are being streamed and given the concurrency limits of lambda, are lambdas actually a good idea to be used as part of a streaming pipeline?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions