Hi,
I am trying to copy a large number of .gz files between two S3 buckets using s3-dist-cp command, submitted as a step in an EMR cluster. While the copy is successful, I noticed a size discrepancy at the destination; the copied files are a little smaller in size.
I downloaded a small sample to my laptop to compare with the source. When uncompressed, the files are identical. gzip -l
reports a compression ratio of 65.8% for the source file and 66.2% for the destination file. So it seems that s3-dist-cp modifies the compression.
Is it possible to just copy the files as-is? In this article it is mentioned that --outputCodec=keep
is supposed to do just that, but in the docs there is no keep
option for the --outputCodec
argument. Instead, it is mentioned that
If you do not specify a value for ‑‑outputCodec, the files are copied over with no change in their compression.
This does not appear to be true. I tried all the below invocations, none achieved the desired result of copying files as-is:
s3-dist-cp --src=s3a://source-bucket/ --dest=s3a://dest-bucket/ --copyFromManifest --previousManifest=s3://dest-bucket/manifest.gz
s3-dist-cp --src=s3a://source-bucket/ --dest=s3a://dest-bucket/ --copyFromManifest --previousManifest=s3://dest-bucket/manifest.gz --outputCodec=keep
s3-dist-cp --src=s3a://source-bucket/ --dest=s3a://dest-bucket/ --copyFromManifest --previousManifest=s3://dest-bucket/manifest.gz --outputCodec=none (this actually uncompressed the files at the destination)
In contrast, when copying .gz.parquet files there is no size discrepancy. So, how can I achieve the same result for .gz files?