How can I optimize performance when I upload large amounts of data to Amazon S3?

3 minute read
0

I am uploading a large amount of data to Amazon Simple Storage Service (Amazon S3), or copy a large amount of data between S3 buckets. How can I optimize the performance of this data transfer?

Resolution

Consider the following methods of transferring large amounts of data to or from Amazon S3 buckets:

Parallel uploads using the AWS Command Line Interface (AWS CLI)

Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.

To potentially decrease the overall time that it takes to complete the transfer, split the transfer into multiple mutually exclusive operations. You can run multiple instances of aws s3 cp (copy), aws s3 mv (move), or aws s3 sync (synchronize) at the same time.

One way to split up your transfer is to use --exclude and --include parameters to separate the operations by file name. For example, you want to copy a large amount of data from one bucket to another bucket. In this example, all of the file names begin with a number. You can run the following commands on two instances of the AWS CLI.

Note: The --exclude and --include parameters are processed on the client side. Because of this, the resources of your local machine might affect the performance of the operation.

Run this command to copy the files with names that begin with the numbers 0 through 4:

aws s3 cp s3://srcbucket/ s3://destbucket/ --recursive --exclude "*" --include "0*" --include "1*" --include "2*" --include "3*" --include "4*"

Run this command to copy the files with names that begin with the numbers 5 through 9:

aws s3 cp s3://srcbucket/ s3://destbucket/ --recursive --exclude "*" --include "5*" --include "6*" --include "7*" --include "8*" --include "9*"

Important: If you must transfer a large number of objects (hundreds of millions), consider building a custom application using an AWS SDK to perform the copy. While the AWS CLI can perform the copy, a custom application might be more efficient at that scale.

AWS Snowball

Consider using AWS Snowball for transfers between your on-premises data centers and Amazon S3, particularly when the data exceeds 10 TB.

Note the following limitations:

  • AWS Snowball doesn't support bucket-to-bucket data transfers.
  • AWS Snowball doesn't support server-side encryption with keys managed by AWS Key Management System (AWS KMS). For more information, see Encryption in AWS Snowball.

S3DistCp with Amazon EMR

Consider using S3DistCp with Amazon EMR to copy data across Amazon S3 buckets. S3DistCp enables parallel copying of large volumes of objects.

Important: Because this option requires you to launch an Amazon EMR cluster, be sure to review Amazon EMR pricing.


Related information

Request rate and performance guidelines

AWS OFFICIAL
AWS OFFICIALUpdated a year ago