download a file from the internet to s3, and then unzip/untar the file on s3 from a Jupyter Notebook


I would like to download a file from the internet directly to s3: Then, I would like to unzip/untar the file and extract it's contents (and folder structure) in s3. Note: There is a folder structure within the tar. I am not planning to do this on multiple tar.gz files-- it is just a one-time operation as a part of a demo in a Jupyter Notebook. What is the simplest, most direct, and most efficient way to accomplish this task?

asked 10 months ago689 views
2 Answers

Hi - Some steps could be

  1. Read the zip file from S3 using the Boto3 S3 resource Object
  2. Open the object using a module which supports working with tar or zip.
  3. Iterate over each file in the zip file using any available list method
  4. Write the file back to another bucket in S3
profile pictureAWS
answered 10 months ago

The suggestion by @Nitin above would certainly work, if preserving the directory tree within the ZIP file is important you may want to look at mounting the S3 bucket onto the Linux host itself.

The officially supported way would be S3 File Gateway but that's expensive, and probably not worth it for a one-off demonstration.

There is also s3fs which will do much the same, although I find it rather slow if it's just for a one-off demonstration you can probably live with it. The of that Github project shows where it's available from, and how to install it.

There's also a very new offering called Mountpoint for S3 which I've not used myself yet, but on a quick reading of that blog it may be also achieve what you want.

profile picture
answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions