- Newest
- Most votes
- Most comments
Hello.
How about using SQS as the target of S3 event trigger?
When an object is uploaded to S3, the object key is sent as a message to the SQS queue.
Asynchronous processing can be implemented by starting Lambda with the EventBridge scheduler and processing messages from an SQS queue once a day.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-how-to-event-types-and-destinations.html
Hi,
In addition to Riku's proposal, I'd suggest to use the Lambda scheduler via EventBridge: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-run-lambda-schedule.html#eb-schedule-create-rule
The above is with the console, but you can also do it programmatically at with Python SDK via boto3 via the parameter ScheduleExpression
of API put_rule.
I would break the 24h in second chunks and use a random generator to assign a second to start for each lambda execution for a user and convert this to the corresponding "cron(x y z a b c)" to spread the workload across the full day.
delete_rule() also exists to manage your rules if some users disappear along the way.
Finally, you should schedule the "schedule lambda" scheduling all others once every day to cope with new / deleted users
Best,
Didier
There are many ways to implement that, but if you aren't too particular about the hour of day when the data is processed, a pretty straightforward approach could start with configuring S3 to produce an S3 Inventory report every day. That would produce all the raw data in a scalable, S3-native manner without any code: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html. You could enable EventBridge notifications for your S3 bucket to receive an event in the default EventBridge bus in the region when the inventory report is delivered, and launch the next step of the process immediately.
You mentioned tens of thousands of users, so I expect many orders of magnitude beyond that aren't required. Given that assumption, the next simple step, triggered by the S3 PutObject notification received via EventBridge (https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory-notification.html), could be to extract the list of folders to process from the S3 inventory pointed to by the inventory manifest file. That shouldn't take long for a single Lambda invocation to do for tens or hundreds of thousands of users in a simple for loop, as you described. You could produce a CSV or JSON file of the list of folders in another S3 bucket.
The delivery of the folder list you could again catch as an S3 PutObject event with EventBridge. As the next step, you could feed the list of folders to a Distributed Map state in Step Functions: https://docs.aws.amazon.com/step-functions/latest/dg/use-dist-map-orchestrate-large-scale-parallel-workloads.html. The distributed mode for the Map state takes a CSV or JSON file as input, and executes a given set of actions for each item. In this case, the action could be to trigger a Lambda function to extract data for a given user from the same S3 inventory report and to analyse it, generate graphs, send emails, and anything else you'd like to do per user. The Step Functions distributed map state would handle parallelisation for you, as well as giving you convenient tools to manage errors and retries at arbitrarily high scales.
If you don't want to use Step Functions, you could alternatively deliver the list of folders as individual, per-user items into an SQS queue, and set the queue to trigger your Lambda function to process an individual user's data.
If you want more control over the timing of the process than you can get with the automated daily delivery of the S3 inventory, you could also list the folders in your own code with the ListObjectsV2 S3 API, using the /
separator initially to obtain only a list of top-level folders, rather than the arbitrarily large numbers of objects in the bucket. You could again feed that list of folders to either Step Functions or SQS, which would invoke a Lambda function for every folder. In this approach, each of those Lambda invocations would have to pull a list of objects to analyse starting with the given user's folder as a prefix, rather than being able to read the data from the S3 inventory report. Note that the ListObjectsV2 calls you'd be making for each user are billable API calls and take some time to complete, and you'll have to take care of handling timeouts, errors, and so on that will likely happen with a growing number of folders. All those downsides could be avoided in a highly scalable and cost-efficient manner with the S3 inventory option.
Relevant content
- asked a year ago