Skip to content

Best strategy to update a csv file at the end of a Lambda function

0

I would like to use Lambda to analyze 10000 images. Lambda should generate the analysis result of one image as one row of numbers, like [374,782,54,22,555,654,901,22,45], for example. The Lambda function will be called for 10000 times, with maximal concurrency to speed it up.

I want to have my end result as a single CSV file with 10000 rows altogether. What is the best way to do this? As far as I know the Lambda layers are read-only so it's not an option. Of course I could create a new CSV file and overwrite the old one in S3 for 10000 times. But I guess that would incur unnecessary time and cost? (and maybe conflicts with concurrency). Of course I could just create 10000 single-line CSV files in S3, download them to computer, and I patch them together myself. But I guess there are more elegant solutions..

asked 2 years ago780 views
3 Answers
1

Hello.

Of course I could create a new CSV file and overwrite the old one in S3 for 10000 times. But I guess that would incur unnecessary time and cost?

Since S3 charges based on the number of requests, this type of operation may result in high costs, depending on how many times this process is performed per day.
https://aws.amazon.com/s3/pricing/?nc1=h_ls

Another option is to temporarily store data using DynamoDB, but this may be more expensive than S3.
https://aws.amazon.com/dynamodb/pricing/?nc1=h_ls

Therefore, I think it would be a good idea to save it to S3 and process the CSV in S3 later using batch processing.

EXPERT
answered 2 years ago
EXPERT
reviewed 2 years ago
0

I could just create 10000 single-line CSV files in S3, download them to computer, and I patch them together myself

A variation of this would be for each invocation of your function to write a one-line CSV to a different area of the bucket, and then an S3 Event Notification that triggers another lambda function to append the contents of this new file to the "main" CSV.

An even better solution, although more complicated and possibly more expensive, is to put these S3 Events into a Kinesis Stream, and that Kinesis Stream triggers the lambda function that appends to the main CSV file (not my idea, credit to https://stackoverflow.com/a/42693053 )

Or depending on how frequently the original function is run, and how up-to-date the main CSV file must be kept, instead of S3 Event Notification you could use EventBridge Scheduler to run a function that does a sweep of all the one-line CSVs every minute (or whatever it needs to be) and then does a bulk append into the main CSV.

EXPERT
answered 2 years ago
AWS
EXPERT
reviewed 2 years ago
0

I would recommend to use Step Functions with a Distributed Map state. The Map state will iterate over the files (in S3) and will process each file using a Lambda function. The function will generate one line. You will then have a single Lambda, after the Map state, to collect all the results and create a single CSV file in S3.

AWS
EXPERT
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.