s3-dist-cp changes compression of already compressed input files

0

Hi,

I am trying to copy a large number of .gz files between two S3 buckets using s3-dist-cp command, submitted as a step in an EMR cluster. While the copy is successful, I noticed a size discrepancy at the destination; the copied files are a little smaller in size. I downloaded a small sample to my laptop to compare with the source. When uncompressed, the files are identical. gzip -l reports a compression ratio of 65.8% for the source file and 66.2% for the destination file. So it seems that s3-dist-cp modifies the compression. Is it possible to just copy the files as-is? In this article it is mentioned that --outputCodec=keep is supposed to do just that, but in the docs there is no keep option for the --outputCodec argument. Instead, it is mentioned that

If you do not specify a value for ‑‑outputCodec, the files are copied over with no change in their compression.

This does not appear to be true. I tried all the below invocations, none achieved the desired result of copying files as-is:

s3-dist-cp --src=s3a://source-bucket/ --dest=s3a://dest-bucket/ --copyFromManifest --previousManifest=s3://dest-bucket/manifest.gz

s3-dist-cp --src=s3a://source-bucket/ --dest=s3a://dest-bucket/ --copyFromManifest --previousManifest=s3://dest-bucket/manifest.gz --outputCodec=keep

s3-dist-cp --src=s3a://source-bucket/ --dest=s3a://dest-bucket/ --copyFromManifest --previousManifest=s3://dest-bucket/manifest.gz --outputCodec=none (this actually uncompressed the files at the destination)

In contrast, when copying .gz.parquet files there is no size discrepancy. So, how can I achieve the same result for .gz files?

nikos64
asked a year ago51 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions