- Newest
- Most votes
- Most comments
Hi,
The part about:
"Preparing a File" but this is immediately followed by "Create a file"
Was just to allow the document viewer to see the process of creating a "fake file" so they can try following the steps using the "fake file" to upload to Glacier vault using the CLI.
After reviewing what needs to be done be hand, I would definitely go with a third-party tool rather than ever attempting to do this by hand. But if you are a glutton for punishment, here are the steps you need to perform (note: I just wanted to see if I could help make sense of the docs, I did NOT actually try these steps, but hopefully it makes the steps clearer):
First using the AWS CLI to create the vault.:
$ aws glacier create-vault --account-id 7356xxxxxxxx --vault-name myvault
{
"location": "/7356xxxxxxxx/vaults/myvault"
}
Next split file 100 GiB file into 100 MiB chunks (on windows use HJ-Split)
Note: 100 GiB file will have 1,000 files and yes you need double enough disk space
This command will setup the vault with the size of each chunk specified.
aws glacier initiate-multipart-upload --account-id 7356xxxxxxxx --archive-description "multipart upload test" --part-size 104857600 --vault-name myvault
The above command will result in the UPLOADID
{
"uploadId": "19gaRezEXAMPLES6Ry5YYdqthHOC_kGRCT03L9yetr220UmPtBYKk-OssZtLqyFu7sY1_lR7vgFuJV6NtcV5zpsJ",
"location": "/123456789012/vaults/myvault/multipart-uploads/19gaRezEXAMPLES6Ry5YYdqthHOC_kGRCT03L9yetr220UmPtBYKk-OssZtLqyFu7sY1_lR7vgFuJV6NtcV5zpsJ"
}
In Windows, set UPLOADID as an environment variable:
set UPLOADID="19gaRezEXAMPLES6Ry5YYdqthHOC_kGRCT03L9yetr220UmPtBYKk-OssZtLqyFu7sY1_lR7vgFuJV6NtcV5zpsJ"
Then need to create and run 1,000 of these for 100 GiB chunk file. Each one will need the appropriate range of bytes for that chunk:
$ aws glacier upload-multipart-part --upload-id %UPLOADID% --body chunkaa --range 'bytes 0-104857599/*' --account-id 7356xxxxxxxx --vault-name myvault
{
"checksum": "e1f2a7cd6e047fa606fe2f0280350f69b9f8cfa602097a9a026360a7edc1f553"
}
$ aws glacier upload-multipart-part --upload-id %UPLOADID% --body chunkab --range 'bytes 104857600-209715199/*' --account-id 7356xxxxxxxx --vault-name myvault
{
"checksum": "e1f2a7cd6e047fa606fe2f0280350f69b9f8cfa602097a9a026360a7edc1f553"
}
$ aws glacier upload-multipart-part --upload-id %UPLOADID% --body chunkac --range 'bytes 209715200-314572799/*' --account-id 7356xxxxxxxx --vault-name myvault
{
"checksum": "e1f2a7cd6e047fa606fe2f0280350f69b9f8cfa602097a9a026360a7edc1f553"
}
…
1,000 of the above for each of the 100 MiB chunks that were from split 100 GiB file. Note: most likely a 3rd party tool vendor would run the above in multiple concurrent threads with configurable retries for better performance/reliability.
Then you will need create hashes for ALL 1,000 files that were created when you split the 100 GiB file into 100 MiB chunks).
Download OpenSSL for windows.
Create a file containing the hash for ALL 1,000 file chunks:
$ openssl dgst -sha256 -binary chunkaa > hash1
$ openssl dgst -sha256 -binary chunkab > hash2
$ openssl dgst -sha256 -binary chunkac > hash3
..
$ openssl dgst -sha256 -binary chunkxxxx > hash1000
Next, create the TREEHASH that is used to verify the upload when finalizing the vault multipart upload by using the following algorithm:
type hash1 hash2 > hash12
openssl dgst -sha256 -binary hash12 > hash12hash
type hash12hash hash3 > hash123
openssl dgst -sha256 -binary hash123 > has123hash
type hash123hash hash4 > hash1234
openssl dgst -sha256 -binary hash1234 > hash1234hash
type hash1234hash hash5 > hash12345
openssl dgst -sha256 -binary hash12345 > hash12345hash
..
continue till you get to 1,000 of these
on you very last one type without putting it into a file
openssl dgst -sha256 -binary hash12345blahblahblah
Something like the following will be returned.
SHA256(hash123)= 9628195fcdbcbbe76cdde932d4646fa7de5f219fb39823836d81f0cc0e18aa67
This is the final TREEHASH that is used to verify that the uploaded parts. Set that value to the TREEHASH environment variable.
set TREEHASH=9628195fcdbcbbe76cdde932d4646fa7de5f219fb39823836d81f0cc0e18aa67
Obviously, doing the above by hand is NOT very feasible for a large file.
Complete the vault upload by providing the archive and supporting checksum TREEHASH:
$ aws glacier complete-multipart-upload --checksum %TREEHASH% --archive-size <total size of archive> --upload-id %UPLOADID% --account-id 7356xxxxxxxx --vault-name myvault
Hopefully, the above break down of pieces will help you better understand what is required for uploading to a Glacier vault using the AWS CLI. Yes, it is very painful, but I don't think AWS ever expected anyone to do this by hand for a large file..
Hope this helps!
-randy
Edited by: RandyTakeshita on Sep 17, 2019 2:00 PM
We recommend using Glacier as an S3 storage class. Large files are best uploaded using muti-part uploads. Here is some helpful documentation on multi-part uploads: https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html
You can split the large file into smaller parts using whichever workflow best fits your environment.
Thank you RandyTakeshita for taking the time to dissect it for me. This is what I had in mind as well but you very neatly lined out all the steps for me in a clear and concise manner, and is exactly what I wanted, thanks again!
Relevant content
- asked 3 years ago
- asked 8 months ago
- asked a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 9 months ago