Issue using SHA256 checksum to check for duplicate file

0

If my user is trying to upload a file I need to check if the file name already exists and if so I need to check if the file is the same as the existing file. Id like to use SHA256 to compare the file. I can get the checksum from S3 using GetObjectAttributes. This works fine if the file was uploaded as a single object but if the file was uploaded using multi-part uploads the ChecksumSHA256 returned by GetObjectAttributes does not match.

Is there another way I could do this check without download the file from S3 first? I don't want to wast the time and cost of download the file if I can avoid it.

asked a year ago1.1K views
3 Answers
1
Accepted Answer

As discussed in the comments, the solution in the blog post doesn't solve your problem. However, the post is dealing with a different problem for which there is no solution, while there is one for yours. You are not trying to compare objects already in S3 for identical content without downloading either one. Instead, you are processing an upload, with direct access to the cleartext content of the object the user is uploading.

Therefore, you can calculate the MD5 hash even for a large object according to the same rules that S3 applies. The number of parts used for uploading the earlier, potentially duplicate object to S3 is indicated by the "part counter" element in the ETag, and usually, the part size would be a simple binary multiple of megabytes, like 16 MB or 64 MB. Assuming that the earlier objects have been uploaded by your application or any other method that would have systematically used one or two multipart upload chunk sizes, no surprises should be expected.

You'd simply check the multipart part counter of the existing object from its ETag, determine the chunk size based on the size of the object and the number of chunks, and calculate the MD5 hash of the object you're about to upload according to the same rules that S3 uses.

The only remaining caveat is that this only works for SSE-S3-encrypted objects, because for SSE-KMS, the customer can't access the ciphertext version of the objects directly, or the bucket keys, to obtain the data key with which they are encrypted. For that reason, the MD5 hash of the ciphertext, which is probably what the ETag contains for SSE-KMS-encrypted objects, is not useful for customers.

EXPERT
answered a year ago
AWS
EXPERT
reviewed a year ago
0

Hi, a full solution on your problem was recently published as a blog post: https://aws.amazon.com/blogs/storage/managing-duplicate-objects-in-amazon-s3/

In this post, I discuss how you can initiate your own data deduplication process 
for objects stored within an S3 bucket. We identify these duplicate objects in your bucket
 using Amazon Athena, validate the duplicates can be removed, and delete them using AWS
 Lambda and S3 Batch Operations. This will help you reduce storage costs for objects with 
duplicate content without having to manually pick out objects to be deleted.

Best,

Didier

profile pictureAWS
EXPERT
answered a year ago
profile picture
EXPERT
reviewed a year ago
  • @didier: Does this solution work with multi-part uploaded objects, it relies on the inventory and the e-tag - which is the MD5 hash of the object if it was NOT uploaded as part of a multi-part upload? For this solution to work with multi-part uploads I think additional checksums needs to be turned on - https://aws.amazon.com/blogs/aws/new-additional-checksum-algorithms-for-amazon-s3/ and you need to compare that checksum from the inventory. This also allows you to use sha* checksums.

  • @Max Celements You're right, the solution in the blog post does not work for multipart-uploaded objects. It mentions it explicitly: "objects that are uploaded through Multipart Upload or Part Copy operation do not have an ETag that can be used for data deduplication, and thus these objects are outside the scope of this post."

0

If you are using SSE-S3 encryption (and not SSE-KMS), the logic for calculating the MD5 hash exposed as the ETag value for an object uploaded as a multipart upload is documented here: https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html#large-object-checksums

If the object is encrypted with SSE-KMS, documentation only says that the ETag value won't be an MD5 hash of the content but doesn't elaborate. I suspect it's an MD5 hash and calculated the same way but of the ciphertext after the object has been encrypted and therefore not an MD5 hash of anything the customer could calculate, since we haven't got access to the data keys used to encrypt the objects when using SSE-KMS.

EXPERT
answered a year ago
profile picture
EXPERT
reviewed a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions