- Newest
- Most votes
- Most comments
The aws s3 sync command is the best option. It supports highly parallelised copying of all sizes of objects, and it'll perform well with 18 million objects, which isn't a large number for S3. It also supports deleting objects in the destination bucket that don't exist in the source bucket, with the --delete parameter.
You should set the CLI to use a sufficiently high degree of parallelism for the copy operation, and to enumerate the maximum supported number of objects/chunks, which is 10,000, at a time, to minimise delays feeding the copy pipeline with the next tasks. You could try starting with the options listed below, and decreasing or increasing the "max_concurrent_requests" value if your client machine gets overwhelmed coordinating the copy or if it doesn't seem to push hard enough. The heavy lifting of transferring the data happens internally between S3's fleet of servers in the AWS region.
aws configure set default.s3.max_concurrent_requests 256
aws configure set default.s3.max_queue_size 10000
aws configure set default.s3.multipart_threshold 64MB
aws configure set default.s3.multipart_chunksize 64MB
aws configure set default.s3.payload_signing_enabled false
It's worth noting that the sync command only considers the object keys and sizes by default when deciding whether an object needs to be copied or already exists in the destination. If it's possible for identically named and sized objects to exist between the two buckets but to differ in contents, it would be safer to empty the target bucket first completely, and only then sync the data from the source bucket. This can also be faster than having the sync command compare lists of millions of objects, which has to be done on the client side.
Use S3 replication.
See docs here: https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html
Hope this helps!
Relevant content
- asked 9 months ago

S3's replication only replicates changes, not existing objects. Batch Operations can process existing objects for an initial copy, but it has serious limitations, such as not supporting copying objects larger than 5 GB at all.
I forgot to mention but my buckets do not use versioning and it seems that the versioning has to be turned on for replication
Replication won't work for your use case anyway, @Moti, because it wouldn't copy existing objects.
There are two types of replication: live replication and on-demand replication.
Live replication – To automatically replicate new and updated objects as they are written to the source bucket, use live replication. Live replication doesn't replicate any objects that existed in the bucket before you set up replication. To replicate objects that existed before you set up replication, use on-demand replication.
On-demand replication – To replicate existing objects from the source bucket to one or more destination buckets on demand, use S3 Batch Replication. For more information about replicating existing objects, see When to use S3 Batch Replication.
That is the S3 Batch Operations copy operation alternative (which isn't actual replication but just a batch copy operation) that I mentioned, but it fails to meet the specifications stated in the question. A batch copy doesn't produce the specified exact replica of the source when the destination contains objects that may not exist at the source -- such as if it's been copied earlier and some objects deleted from the source. Beyond what was stated in the question, it also doesn't support copying objects larger than 5 GB. The CLI's sync feature doesn't have either limitation.