Combine streams from S3 object listing and ObjectCreated events

0

I am hoping to create a stream of messages, and the stream will be defined by providing an S3 bucket + prefix, and will stream:

  1. a list of all existing objects at the given prefix
  2. new objects that are uploaded under the prefix, via the ObjectCreated event

It will need to handle deduplication, so that if the same object appears in both 1 and 2 (due to async processing), the file will be processed exactly once.

I need to be able to create new instances of the stream at any time, and direct them to destinations such as Lambda or SQS, e.g. when changing the code that processes the file to produce a newer version of the output, I will create a new Lambda and replay the stream from the beginning (with a preference for newer files first)

Currently I am using the Databricks Autoloader, but I would like to move my solution back to AWS to have more control over the implementation. How can I build this functionality on AWS? Thanks!

For context, I have also discussed this solution in the Databricks forums: https://community.databricks.com/s/question/0D58Y000091yskKSAQ/help-integrating-streaming-pipeline-with-aws-services-s3-and-lambda

답변 없음

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠