HELP with uploading large files to Glacier



I don't know why glacier is so convoluted/tricky to manage. Is this part of the reason why it's so cheap?

I'm merely trying to upload some large vmdk's to glacier.

I've been trawling AWS documentation for days and I literally have 50 tabs open detailing exactly, nothing. All the progress I've made thus far (just creating a freaking vault) was done with the help of information on other obscure blogs and youtube videos.

For example on there is a heading "Preparing a File" but this is immediately followed by "Create a file". So which is it now, preparing an EXISTING file or creating a NEW file? Is this step even required? Why is this so convoluted?

Then from the looks of it, I need to use some ancient buggy Windows XP program to split the files into chunks before uploading? Are you kidding me??! So, it already took the best part of a day to export this large vmdk. Now I have to spend another day merely "splitting" this into chunks (If I have enough HDD space that is) and then I have to make sure I don't make any mistakes in the cli code to follow by correctly stating the starting/ending bytes, FOR EACH CHUNK. Then another day for uploading this, and another day to reassemble it? Again, are you kidding me?! If I have a 100GB file, how many chunks will this result in? I have to address EACH chunk with its own special little line of code. Absolutely bonkers.

I'm on CBT Nuggets and TrainSignal, neither of these have any support videos on Glacier, does anyone know of any other material that will help me grasp what exactly I need to do in order to upload large files to Glacier?

I know there are 3rd party clients available but I'd like to understand how to do this via cmd.

Thanks for reading.

Edited by: fnanfne on Sep 17, 2019 4:19 AM

Edited by: fnanfne on Sep 17, 2019 8:53 AM

asked 4 years ago226 views
3 Answers
Accepted Answer

The part about:

"Preparing a File" but this is immediately followed by "Create a file"

Was just to allow the document viewer to see the process of creating a "fake file" so they can try following the steps using the "fake file" to upload to Glacier vault using the CLI.

After reviewing what needs to be done be hand, I would definitely go with a third-party tool rather than ever attempting to do this by hand. But if you are a glutton for punishment, here are the steps you need to perform (note: I just wanted to see if I could help make sense of the docs, I did NOT actually try these steps, but hopefully it makes the steps clearer):

First using the AWS CLI to create the vault.:

$ aws glacier create-vault --account-id 7356xxxxxxxx --vault-name myvault
    "location": "/7356xxxxxxxx/vaults/myvault"

Next split file 100 GiB file into 100 MiB chunks (on windows use HJ-Split)
Note: 100 GiB file will have 1,000 files and yes you need double enough disk space

This command will setup the vault with the size of each chunk specified.

aws glacier initiate-multipart-upload --account-id 7356xxxxxxxx --archive-description "multipart upload test" --part-size 104857600 --vault-name myvault

The above command will result in the UPLOADID

    "uploadId": "19gaRezEXAMPLES6Ry5YYdqthHOC_kGRCT03L9yetr220UmPtBYKk-OssZtLqyFu7sY1_lR7vgFuJV6NtcV5zpsJ",
    "location": "/123456789012/vaults/myvault/multipart-uploads/19gaRezEXAMPLES6Ry5YYdqthHOC_kGRCT03L9yetr220UmPtBYKk-OssZtLqyFu7sY1_lR7vgFuJV6NtcV5zpsJ"

In Windows, set UPLOADID as an environment variable:

set UPLOADID="19gaRezEXAMPLES6Ry5YYdqthHOC_kGRCT03L9yetr220UmPtBYKk-OssZtLqyFu7sY1_lR7vgFuJV6NtcV5zpsJ"

Then need to create and run 1,000 of these for 100 GiB chunk file. Each one will need the appropriate range of bytes for that chunk:

$ aws glacier upload-multipart-part --upload-id %UPLOADID% --body chunkaa --range 'bytes 0-104857599/*' --account-id 7356xxxxxxxx --vault-name myvault
    "checksum": "e1f2a7cd6e047fa606fe2f0280350f69b9f8cfa602097a9a026360a7edc1f553"
$ aws glacier upload-multipart-part --upload-id %UPLOADID% --body chunkab --range 'bytes 104857600-209715199/*' --account-id 7356xxxxxxxx --vault-name myvault
    "checksum": "e1f2a7cd6e047fa606fe2f0280350f69b9f8cfa602097a9a026360a7edc1f553"
$ aws glacier upload-multipart-part --upload-id %UPLOADID% --body chunkac --range 'bytes 209715200-314572799/*' --account-id 7356xxxxxxxx --vault-name myvault
    "checksum": "e1f2a7cd6e047fa606fe2f0280350f69b9f8cfa602097a9a026360a7edc1f553"
1,000 of the above for each of the 100 MiB chunks that were from split 100 GiB file. Note: most likely a 3rd party tool vendor would run the above in multiple concurrent threads with configurable retries for better performance/reliability.

Then you will need create hashes for ALL 1,000 files that were created when you split the 100 GiB file into 100 MiB chunks).
Download OpenSSL for windows.
Create a file containing the hash for ALL 1,000 file chunks:

$ openssl dgst -sha256 -binary chunkaa > hash1
$ openssl dgst -sha256 -binary chunkab > hash2
$ openssl dgst -sha256 -binary chunkac > hash3
$ openssl dgst -sha256 -binary chunkxxxx > hash1000

Next, create the TREEHASH that is used to verify the upload when finalizing the vault multipart upload by using the following algorithm:

type hash1 hash2 > hash12
openssl dgst -sha256 -binary hash12 > hash12hash
type hash12hash hash3 > hash123
openssl dgst -sha256 -binary hash123 > has123hash
type hash123hash hash4 > hash1234
openssl dgst -sha256 -binary hash1234 > hash1234hash
type hash1234hash hash5 > hash12345
openssl dgst -sha256 -binary hash12345 > hash12345hash
continue till you get to 1,000 of these

on you very last one type without putting it into a file

   openssl dgst -sha256 -binary hash12345blahblahblah

Something like the following will be returned.

 SHA256(hash123)= 9628195fcdbcbbe76cdde932d4646fa7de5f219fb39823836d81f0cc0e18aa67

This is the final TREEHASH that is used to verify that the uploaded parts. Set that value to the TREEHASH environment variable.

set TREEHASH=9628195fcdbcbbe76cdde932d4646fa7de5f219fb39823836d81f0cc0e18aa67

Obviously, doing the above by hand is NOT very feasible for a large file.
Complete the vault upload by providing the archive and supporting checksum TREEHASH:

$ aws glacier complete-multipart-upload --checksum %TREEHASH% --archive-size <total size of archive> --upload-id %UPLOADID% --account-id 7356xxxxxxxx --vault-name myvault

Hopefully, the above break down of pieces will help you better understand what is required for uploading to a Glacier vault using the AWS CLI. Yes, it is very painful, but I don't think AWS ever expected anyone to do this by hand for a large file..

Hope this helps!

Edited by: RandyTakeshita on Sep 17, 2019 2:00 PM

answered 4 years ago

We recommend using Glacier as an S3 storage class. Large files are best uploaded using muti-part uploads. Here is some helpful documentation on multi-part uploads:

You can split the large file into smaller parts using whichever workflow best fits your environment.

answered 4 years ago

Thank you RandyTakeshita for taking the time to dissect it for me. This is what I had in mind as well but you very neatly lined out all the steps for me in a clear and concise manner, and is exactly what I wanted, thanks again!

answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions