How do you run a function that generates an output concurrently for users items in an S3 Bucket?

0

I am looking into some insight into which services I would use to implement the following scenario.

I have a collection of users, all who's data is contained in an S3 bucket. Data is collected daily from each user's account.

I would like to implement a system, where the daily data of each user is analysed and the outputs are given to them. Essentially what I imagine this is, is a Cloudwatch event implementation to set off a trigger every 24hrs to run a lambda script that analyses each users data and outputs something at the end, e.g. a graph or an image.

However, I can't seem to grasp how I can asynchronously run this lambda function for each users data. If the data is analysed every 24hrs, the output should be generated for all users at the same time. Perhaps for 5-10 users running through the chronological (which I can do) is okay as the outputs wouldn't be that far apart in time, but as the users scale 100, 1000, 10'000 a so fourth it doesn't make sense to run a single function in a for loop.

The core question is how do I run a lambda function asynchronously for many users all who's data is stored in seperate folders in S3?

3 Answers
1

Hello.

How about using SQS as the target of S3 event trigger?
When an object is uploaded to S3, the object key is sent as a message to the SQS queue.
Asynchronous processing can be implemented by starting Lambda with the EventBridge scheduler and processing messages from an SQS queue once a day.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-how-to-event-types-and-destinations.html

profile picture
EXPERT
answered 13 days ago
profile picture
EXPERT
reviewed 13 days ago
profile pictureAWS
EXPERT
reviewed 13 days ago
1

Hi,

In addition to Riku's proposal, I'd suggest to use the Lambda scheduler via EventBridge: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-run-lambda-schedule.html#eb-schedule-create-rule

The above is with the console, but you can also do it programmatically at with Python SDK via boto3 via the parameter ScheduleExpression of API put_rule.

I would break the 24h in second chunks and use a random generator to assign a second to start for each lambda execution for a user and convert this to the corresponding "cron(x y z a b c)" to spread the workload across the full day.

delete_rule() also exists to manage your rules if some users disappear along the way.

Finally, you should schedule the "schedule lambda" scheduling all others once every day to cope with new / deleted users

Best,

Didier

profile pictureAWS
EXPERT
answered 13 days ago
EXPERT
reviewed 13 days ago
0

There are many ways to implement that, but if you aren't too particular about the hour of day when the data is processed, a pretty straightforward approach could start with configuring S3 to produce an S3 Inventory report every day. That would produce all the raw data in a scalable, S3-native manner without any code: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html. You could enable EventBridge notifications for your S3 bucket to receive an event in the default EventBridge bus in the region when the inventory report is delivered, and launch the next step of the process immediately.

You mentioned tens of thousands of users, so I expect many orders of magnitude beyond that aren't required. Given that assumption, the next simple step, triggered by the S3 PutObject notification received via EventBridge (https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory-notification.html), could be to extract the list of folders to process from the S3 inventory pointed to by the inventory manifest file. That shouldn't take long for a single Lambda invocation to do for tens or hundreds of thousands of users in a simple for loop, as you described. You could produce a CSV or JSON file of the list of folders in another S3 bucket.

The delivery of the folder list you could again catch as an S3 PutObject event with EventBridge. As the next step, you could feed the list of folders to a Distributed Map state in Step Functions: https://docs.aws.amazon.com/step-functions/latest/dg/use-dist-map-orchestrate-large-scale-parallel-workloads.html. The distributed mode for the Map state takes a CSV or JSON file as input, and executes a given set of actions for each item. In this case, the action could be to trigger a Lambda function to extract data for a given user from the same S3 inventory report and to analyse it, generate graphs, send emails, and anything else you'd like to do per user. The Step Functions distributed map state would handle parallelisation for you, as well as giving you convenient tools to manage errors and retries at arbitrarily high scales.

If you don't want to use Step Functions, you could alternatively deliver the list of folders as individual, per-user items into an SQS queue, and set the queue to trigger your Lambda function to process an individual user's data.

If you want more control over the timing of the process than you can get with the automated daily delivery of the S3 inventory, you could also list the folders in your own code with the ListObjectsV2 S3 API, using the / separator initially to obtain only a list of top-level folders, rather than the arbitrarily large numbers of objects in the bucket. You could again feed that list of folders to either Step Functions or SQS, which would invoke a Lambda function for every folder. In this approach, each of those Lambda invocations would have to pull a list of objects to analyse starting with the given user's folder as a prefix, rather than being able to read the data from the S3 inventory report. Note that the ListObjectsV2 calls you'd be making for each user are billable API calls and take some time to complete, and you'll have to take care of handling timeouts, errors, and so on that will likely happen with a growing number of folders. All those downsides could be avoided in a highly scalable and cost-efficient manner with the S3 inventory option.

EXPERT
Leo K
answered 13 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions