I'm using the AWS Command Line Interface (AWS CLI) sync command to transfer data on Amazon Simple Storage Service (Amazon S3). However, the transfer is taking a long time to complete. How can I improve the performance of a transfer using the sync command?
Try the following approaches for improving the transfer time when you run the sync command:
Note: The sync command compares the source and destination buckets to determine which source files don't exist in the destination bucket. The sync command also determines which source files were modified when compared to the files in the destination bucket. Then, the sync command copies the new or updated source files to the destination bucket. The number of objects in the source and destination bucket can impact the time it takes for the sync command to complete the process. It's important to understand how transfer size can impact the duration of the sync or the cost that you can incur from requests to S3.
Running multiple instances of the AWS CLI
To copy a large amount of data, you can run multiple instances of the AWS CLI to perform separate sync operations in parallel. For example, you can run parallel sync operations for different prefixes:
aws s3 sync s3://source-AWSDOC-EXAMPLE-BUCKET/folder1 s3://destination-AWSDOC-EXAMPLE-BUCKET/folder1
aws s3 sync s3://source-AWSDOC-EXAMPLE-BUCKET/folder2 s3://destination-AWSDOC-EXAMPLE-BUCKET/folder2
Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent AWS CLI version.
Or, you can run parallel sync operations for separate exclude and include filters. For example, the following operations separate the files to sync by key names that begin with numbers 0 through 4, and numbers 5 through 9:
Note: Even when you use exclude and include filters, the sync command still reviews all files in the source bucket. This review helps to identify which source files are to be copied over to the destination bucket. If you have multiple sync operations that target different key name prefixes, then each sync operation reviews all the source files. However, because of the exclude and include filters, only the files that are included in the filters are copied to the destination bucket.
aws s3 sync s3://source-AWSDOC-EXAMPLE-BUCKET/ s3://destination-AWSDOC-EXAMPLE-BUCKET/ --exclude "*" --include "0*" --include "1*" --include "2*" --include "3*" --include "4*"
aws s3 sync s3://source-AWSDOC-EXAMPLE-BUCKET/ s3://destination-AWSDOC-EXAMPLE-BUCKET/ --exclude "*" --include "5*" --include "6*" --include "7*" --include "8*" --include "9*"
For more information on optimizing the performance of your workload, see Best practices design patterns: Optimizing Amazon S3 performance.
Modifying the AWS CLI configuration value for max_concurrent_requests
To potentially improve performance, you can modify the value of max_concurrent_requests. This value sets the number of requests that can be sent to Amazon S3 at a time. The default value is 10, and you can increase it to a higher value. However, note the following:
- Running more threads consumes more resources on your machine. You must be sure that your machine has enough resources to support the maximum number of concurrent requests that you want.
- Too many concurrent requests can overwhelm a system, which might cause connection timeouts or slow the responsiveness of the system. To avoid timeout issues from the AWS CLI, you can try setting the --cli-read-timeout value or the --cli-connect-timeout value to 0.
(Optional) Checking the instance configuration
If you're using an Amazon Elastic Compute Cloud (Amazon EC2) instance to run the sync operation, consider the following:
- Review the instance type that you're using. Instance types that are larger can provide better results, because they have high bandwidth and Amazon Elastic Block Store (Amazon EBS)-optimized networks.
- If the instance is in a different AWS Region than the bucket, then use an instance in the same Region. To reduce latency, reduce the geographical distance between the instance and your Amazon S3 bucket.
- If the instance is in the same Region as the source bucket, then set up an Amazon Virtual Private Cloud (Amazon VPC) endpoint for S3. VPC endpoints can help improve overall performance.
How can I use Data Pipeline to run a one-time copy or automate a scheduled synchronization of my Amazon S3 buckets?