I want to transfer a large amount of data (1 TB or more) from one Amazon Simple Storage Service (Amazon S3) bucket to another bucket.
Depending on your use case, you can perform the data transfer between buckets using one of the following options:
- Run parallel uploads using the AWS Command Line Interface (AWS CLI)
- Use an AWS SDK
- Use cross-Region replication or same-Region replication
- Use Amazon S3 batch operations
- Use S3DistCp with Amazon EMR
- Use AWS DataSync
Run parallel uploads using the AWS CLI
Note: As a best practice, make sure that you're using the most recent version of the AWS CLI. For more information, see Installing or updating the latest version of the AWS CLI.
You can split the transfer into multiple mutually exclusive operations to improve the transfer time by multi-threading. For example, you can run multiple, parallel instances of aws s3 cp, aws s3 mv, or aws s3 sync using the AWS CLI. You can create more upload threads while using the --exclude and --include parameters for each instance of the AWS CLI. These parameters filter operations by file name.
Note: The --exclude and --include parameters are processed on the client side. Therefore, note that resources on your local machine might affect the performance of the operation.
For example, to copy a large amount of data from one bucket to another (where file names begin with a number), run the following commands.
First, run this command to copy the files with names that begin with the numbers 0 through 4:
aws s3 cp s3://source-awsexamplebucket/ s3://destination-awsexamplebucket/ --recursive --exclude "*" --include "0*" --include "1*" --include "2*" --include "3*" --include "4*"
Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.
Then, run this command on a second AWS CLI instance to copy the files with names that begin with the numbers 5 through 9:
aws s3 cp s3://source-awsexamplebucket/ s3://destination-awsexamplebucket/ --recursive --exclude "*" --include "5*" --include "6*" --include "7*" --include "8*" --include "9*"
Additionally, you can customize the following AWS CLI configurations to speed up the data transfer:
- multipart_chunksize: This value sets the size of each part that the AWS CLI uploads in a multipart upload for an individual file. This setting allows you to break down a larger file (for example, 300 MB) into smaller parts for quicker upload speeds.
Note: A multipart upload requires that a single file is uploaded in not more than 10,000 distinct parts. You must be sure that the chunksize that you set balances the part file size and the number of parts.
- max_concurrent_requests: This value sets the number of requests that can be sent to Amazon S3 at a time. The default value is 10. You can increase it to a higher value like resources on your machine. You must be sure that your machine has enough resources to support the maximum number of concurrent requests that you want.
Use an AWS SDK
Consider building a custom application using an AWS SDK to perform the data transfer for a very large number of objects. The AWS CLI can also be used to perform a copy operation. However, a custom application might be more efficient at performing a transfer at the scale of hundreds of millions of objects.
Use cross-Region replication or same-Region replication
After you set up cross-Region replication (CRR) or same-Region replication (SRR) on the source bucket, Amazon S3 automatically replicates new objects from the source bucket to the destination bucket. You can choose to filter which objects are replicated using a prefix or tag. For more information on configuring replication and specifying a filter, see Replication configuration overview.
After replication is configured, only new objects are replicated to the destination bucket. Existing objects aren't replicated to the destination bucket. For more information, see Replicating existing objects with S3 Batch Replication.
Use Amazon S3 batch operations
You can use Amazon S3 batch operations to copy multiple objects with a single request. When you create a batch operation job, you specify which objects to perform the operation on using an Amazon S3 inventory report. Or, you can use a CSV manifest file to specify a batch job. Then, Amazon S3 batch operations call the API to perform the operation.
After the batch operation job is complete, you get a notification and you can choose to receive a completion report about the job.
Use S3DistCp with Amazon EMR
The S3DistCp operation on Amazon EMR can perform parallel copying of large volumes of objects across Amazon S3 buckets. S3DistCp first copies the files from the source bucket to the worker nodes in an Amazon EMR cluster. Then, the operation writes the files from the worker nodes to the destination bucket. For more guidance on using S3DistCp, see Seven tips for using S3DistCp on Amazon EMR to move data efficiently between HDFS and Amazon S3.
Important: Because this option requires you use Amazon EMR, be sure to review Amazon EMR pricing.
Use AWS DataSync
To move large amounts of data from one Amazon S3 bucket to another bucket, perform the following steps:
1. Open the AWS DataSync console.
2. Create a task.
3. Create a new location for Amazon S3.
4. Select your S3 bucket as the source location.
5. Update the source location configuration settings. Make sure to specify the AWS Identity Access Management (IAM) role that will be used to access your source S3 bucket.
6. Select your S3 bucket as the destination location.
7. Update the destination location configuration settings. Make sure to specify the AWS Identity Access Management (IAM) role that will be used to access your S3 destination bucket.
8. Configure settings for your task.
9. Review the configuration details.
10. Choose Create task.
11. Start your task.
Important: When you use AWS DataSync, you will incur additional costs. To preview any DataSync costs, review the DataSync pricing structure and DataSync limits.