Options to accelerate s3 copy of 4TB worth files between S3 buckets in same region

0

A customer has S3 bucket that contains 12.2 million files which totals to 4TB data. Most of the files in the bucket are less than a few MB. All of the files are in one folder. They have to move these files from one bucket to another. They said last time they tried it took them days to transfer. They are looking ways to reduce the copy time.

One recommendation could be to batch and compress (using tar, zip, etc) the files before transfer. In that case, Is there any approximate ideal size we can recommend?

Also, are there any other/additional solutions that we can recommend to reduce the time to transfer data between buckets in same region.

AWS
질문됨 4년 전795회 조회
2개 답변
1
수락된 답변

Check out: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html

Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by subsequent steps in your Amazon EMR cluster. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3. S3DistCp is more scalable and efficient for parallel copying large numbers of objects across buckets and across AWS accounts.

AWS
전문가
mhjwork
답변함 4년 전
0

You can parallelize your copying process with the aws s3 cli, using sync and --exclude and --include. For example, if all your files are in one folder, you might break them down by the letter of the alphabet they are starting with or some other scheme that you know will distribute them in parts.

Performance is better when the the "prefixes" of the url are well distributed, so it will be better if all files are in the "top folder", not in some sub folder.

I used this method to transfer 280000 images for someone and recall that each aws s3 sync process needed about 1/3 of a CPU, so was using a 4 core server to run around 10 processes in parallel.

답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠