Millions files download

0

Hi— I have inherited a project that uses S3 storage (In fact I’m told it was one of the first users of S3). The project is now at the end of its life (i.e. out of money) and we need to achieve the data.

The problem is that they have stored all the data in a single bucket and then a folder under that.

This folder has 51,198,860 files and approximately 430GB in it. We wrote a PHP script to download the data, however, at the current rate it will take an estimated 3 months to finish the copy.

Any help? Ideas?

asked a year ago495 views
3 Answers
2

See this blog post. https://aws.amazon.com/blogs/storage/cross-account-bulk-transfer-of-files-using-amazon-s3-batch-operations/

Similar to the blog, my recommendation would be to use the S3 Inventory to get a list of the files in the bucket then do some scripting (on an EC2 instance close to the S3 data) to make zips of files from the inventory list to another bucket -- goal is to create much fewer but larger (perhaps 1GB) files. Once you have fewer but larger files, proceed with the download. This should help utilize your bandwidth for meaningful transfers rather than millions of connects/disconnects.

Hope this helps.

AWS
EXPERT
answered a year ago
EXPERT
reviewed a year ago
AWS
EXPERT
reviewed a year ago
  • I agree: zipping your multiple files in bigger archives will accelerate.

0
Accepted Answer

The number of files isn't astronomically large but it's certainly huge for a single-thread PHP script to process one file at a time. You can speed the copying up by a couple of orders of magnitude by using the command-line interface (AWS CLI) instead. Since your files are only an average of 8 kiB in size, you could set max_concurrent_requests to 128 to start with for about a hundred-fold performance increase, and optionally experiment with larger values if you'd like. Also set max_queue_size to 10,000. https://awscli.amazonaws.com/v2/documentation/api/latest/topic/s3-config.html

Then download the files like so:

aws s3 sync s3://my-bucket-name /local-destination

You can do this on an EC2 instance in the same region with the bucket, as iBehr correctly suggested, but you'll get massively improved performance from any location, without intermediate steps or additional infrastructure, compared to the single-thread copy by utilising the built-in parallelisation capability of the AWS CLI/SDK.

If you choose to copy the files on an EC2 instance first, I suggest you make sure you have a VPC gateway endpoint for S3 in the VPC before starting to copy. It'll avoid the added cost of running the replication traffic through a NAT gateway.

EXPERT
answered a year ago
AWS
EXPERT
reviewed 3 months ago
0

Thank you! This worked great.

answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions