Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by subsequent steps in your Amazon EMR cluster. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3. S3DistCp is more scalable and efficient for parallel copying large numbers of objects across buckets and across AWS accounts.
You can parallelize your copying process with the aws s3 cli, using sync and --exclude and --include. For example, if all your files are in one folder, you might break them down by the letter of the alphabet they are starting with or some other scheme that you know will distribute them in parts.
Performance is better when the the "prefixes" of the url are well distributed, so it will be better if all files are in the "top folder", not in some sub folder.
I used this method to transfer 280000 images for someone and recall that each aws s3 sync process needed about 1/3 of a CPU, so was using a 4 core server to run around 10 processes in parallel.
How can files in an S3 bucket be tracked?Accepted Answerasked 3 years ago
Batch download files from multiple different folders in the same S3 bucketasked 9 months ago
How do I copy files from my S3 bucket hosted in Europe to my S3 bucket hosted in the United States?asked a year ago
Datasync to copy data from one S3 bucket to another S3 in the same accountasked 2 months ago
move files between s3 buckets upon complete loadAccepted Answerasked 8 months ago
Options to accelerate s3 copy of 4TB worth files between S3 buckets in same regionAccepted Answerasked 2 years ago
Data Pipeline stops processing files in S3 bucketAccepted Answerasked a year ago
Unable to copy elasticache for redis backup to S3 bucket in the same regionAccepted Answerasked 3 months ago
Detect data types of random files in S3 bucketasked 5 months ago
Unzipping files from S3 bucketasked a year ago