Unzipping files from S3 bucket

0

Dear Team,

We ingest bunch of files in AWS S3 bucket in zip format on daily basis. We will need to unzip them for further processing. Whats is the best (optimized) way to unzip ?

Will unzipping through lambda and uploading back to another S3 bucket be better option or unzipping to an EFS file system a better option ?

Note: After unzipping, we need to upload the unzipped files to target S3 bucket in different folder structure.

Regards, Dhaval Mehta

2 Answers
1

Hello. We do a lot of the same although not to unzip strictly speaking, but much of the same. At the time we implemented it, AWS Lambda compute resources were limited to 3BG in RAM and this was breaking our processing, so we moved to using AWS ECS + Fargate to do all the job. This was superbly easy as all I had to do was take the code the devs did, implement the few lines of logic that was required to wrap SQS (poll/change visibility/delete messages) that AWS Lambda did for us at the time.

We got our ECS service scale based on the depth of messages in the queue. The only real downside of that solution vs native lambda (well, if you forget the management of docker images in ECR instead of just the code in Lambda I suppose) is the 1 minute minimum it would take to go from 0 containers to N (where N is however many containers you want) given that AWS SQS metrics have 1 minute granularity (min). But our use-case is ETL so waiting a minute before crunching millions of records worth of files was not a significant issue.

With the newer feature in S3 notification, I would recommend you figure out whether Lambda or ECS is the right place for you to run the jobs based on the file attribute that you will see in the payload (i.e. file size) but I can't recommend more to use SQS to deal with keeping track of the processing jobs to do on these files vs SNS for which you have to implement yourself retry/replay.

Keep the code capable to run in one place or another (simply by invoking the code in ECS the same way you would in Lambda) and time will tell you which is best for your use-case.

For my devs who had written the ETL part of the code all I had to do was to re-use the code from this repo that deals with SQS and invoke their lambda_handler(event, context) function by giving it the SQS message payload and that was all I had to do.

profile picture
answered 2 years ago
0

I think it's more efficient to save it to S3 directly. But I really would like to hear what are your pros for EFS before going to S3 back. I think even if the files are big and time consuming to extract it's not needed as long as you will have enough space in Lambda. Other solution for longer and more memory consuming proces is ECS Batch with Fargate launch type.

profile picture
MG
answered 2 years ago
  • We are dealing with the zip file size of around 50 MB. The extracted folder may have approx 1000 pdf and xml files of around 30 kb each. As soon as the unzipped files are processed and moved to different S3 bucket, we need to delete the unzipped file in source S3 bucket. There will be 1000's of zip files we need to process daily.

    I was looking at below consideration :

    1. access speed of EFS vs. S3 from Lambda. I know EFS access is faster than S3
    2. S3 charging is based on no. of request(put/read) vs efs does not charge based on the request charge
    3. Using the unzipped files as temporary storage as eventually, we would be moving it to target S3 bucket in few minutes

    Let me know if this inputs helps.

  • I see, then if it's only temporary storage then EFS sounds better as you said. I thought that S3 used in Lambda will be the final one.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions