- Newest
- Most votes
- Most comments
As discussed in the comments, the solution in the blog post doesn't solve your problem. However, the post is dealing with a different problem for which there is no solution, while there is one for yours. You are not trying to compare objects already in S3 for identical content without downloading either one. Instead, you are processing an upload, with direct access to the cleartext content of the object the user is uploading.
Therefore, you can calculate the MD5 hash even for a large object according to the same rules that S3 applies. The number of parts used for uploading the earlier, potentially duplicate object to S3 is indicated by the "part counter" element in the ETag, and usually, the part size would be a simple binary multiple of megabytes, like 16 MB or 64 MB. Assuming that the earlier objects have been uploaded by your application or any other method that would have systematically used one or two multipart upload chunk sizes, no surprises should be expected.
You'd simply check the multipart part counter of the existing object from its ETag, determine the chunk size based on the size of the object and the number of chunks, and calculate the MD5 hash of the object you're about to upload according to the same rules that S3 uses.
The only remaining caveat is that this only works for SSE-S3-encrypted objects, because for SSE-KMS, the customer can't access the ciphertext version of the objects directly, or the bucket keys, to obtain the data key with which they are encrypted. For that reason, the MD5 hash of the ciphertext, which is probably what the ETag contains for SSE-KMS-encrypted objects, is not useful for customers.
Hi, a full solution on your problem was recently published as a blog post: https://aws.amazon.com/blogs/storage/managing-duplicate-objects-in-amazon-s3/
In this post, I discuss how you can initiate your own data deduplication process
for objects stored within an S3 bucket. We identify these duplicate objects in your bucket
using Amazon Athena, validate the duplicates can be removed, and delete them using AWS
Lambda and S3 Batch Operations. This will help you reduce storage costs for objects with
duplicate content without having to manually pick out objects to be deleted.
Best,
Didier
If you are using SSE-S3 encryption (and not SSE-KMS), the logic for calculating the MD5 hash exposed as the ETag value for an object uploaded as a multipart upload is documented here: https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html#large-object-checksums
If the object is encrypted with SSE-KMS, documentation only says that the ETag value won't be an MD5 hash of the content but doesn't elaborate. I suspect it's an MD5 hash and calculated the same way but of the ciphertext after the object has been encrypted and therefore not an MD5 hash of anything the customer could calculate, since we haven't got access to the data keys used to encrypt the objects when using SSE-KMS.
Relevant content
- asked 3 years ago
@didier: Does this solution work with multi-part uploaded objects, it relies on the inventory and the e-tag - which is the MD5 hash of the object if it was NOT uploaded as part of a multi-part upload? For this solution to work with multi-part uploads I think additional checksums needs to be turned on - https://aws.amazon.com/blogs/aws/new-additional-checksum-algorithms-for-amazon-s3/ and you need to compare that checksum from the inventory. This also allows you to use sha* checksums.
@Max Celements You're right, the solution in the blog post does not work for multipart-uploaded objects. It mentions it explicitly: "objects that are uploaded through Multipart Upload or Part Copy operation do not have an ETag that can be used for data deduplication, and thus these objects are outside the scope of this post."