Create S3 Inventory for specific folder and subfolders

0

I want to create an S3 inventory report for a specific folder and subfolders in my S3 bucket. I have the instructions on how to create the report but want to limit the results to only specific folders. For example, here's what I have in the AWS Admin console.

Parent folder - Bucket1/Server1/I$/J CASES Subfolders under J CASES Jackson Johnson Jones etc.

All I'm trying to do is to get the size of the subfolders, not each and every file in the subfolder as there could be hundreds or thousands of files. Something like this:

Folder NameSize
Jackson10.5 GB
Johnson21.6 GB
Jones35.7 GB

I thought about using the prefix "J CASES" but not sure if that's only going to list the parent folder size or if it's going to include all the subfolders as well.

asked 6 months ago470 views
3 Answers
0
Accepted Answer

To generate an inventory report for just the J CASES folder and its subfolders, you'll want to set the Prefix filter to 'J CASES/'. This will limit the inventory to only objects with key names starting with that prefix.

For the Output schema, choose the CSV format. This allows us to easily work with the results in a spreadsheet.

The important settings:

Prefix: J CASES/ CSV as output format Enable size metrics This will generate a CSV file with one row per object, including the object key name and size in bytes.

To get the folder sizes you want, you can post-process this CSV:

Sort by the Key field so subfolders are grouped together Parse out the folder names from the Key Sum the Size column for rows with the same folder name This will give you the total size for each subfolder under J CASES without having to enumerate every single object.

profile pictureAWS
answered 6 months ago
profile picture
EXPERT
reviewed a month ago
0

If your objective is to just get the aggregated size of prefixes not the underlying object/key names then Storage lens advance metrics dashboard with Prefix aggregation is also a neat option. There you no need to do any filtration on csv/parquet data. Storage lens advanced metric dashboard has slightly higher cost compare to S3 inventory and it can be created at bucket level not at prefix level but it wont hurt you if number of in-scope objects in your bucket are few millions. @ $0.20 per million objects monitored per month, if you disable this dashboard after 2 days once the data is ingested then no further cost will incur. On this dashboard you can easily see the aggregate size of all prefixes under Bucket1/Server1/I$/J CASES. However I would prefer the S3 inventory route, who knows after seeing the size of sub prefixes you want to dive deep and check what objects are taking that space, sorting objects by size, etc . So in such unforeseen asks the same inventory report would give you all the answers at no additional cost. Please also note that S3 inventory will span across multiple files/reports if number of objects are in millions. Ex 10M objects will have atleast 4 reports for a single day, if thats the case then Athena query is ideal than manual filtering from all .csv files one after other.

Important - Pls ensure inventory scope is "Include all versions" otherwise by default only current versions are scoped in. Also ensure additional metadata field "size" is also selected at the time of S3 inventory creation otherwise S3 inventory with default settings/metadata does only include two fields : bucket_name, key_name.

Also to contain the cost I would strongly recommend to create weekly S3 inventory instead of daily, it will give 1st report in same time line (within 48 hrs) but does give you extra days to delete the S3 inventory config before next schedule/Sunday. The cost of S3 inventory is based on no# of objects per report (not per month) so deleting the config quickly after gathering 1st report would be cost efficient.

razguru
answered 6 months ago
0

Thank you both for these answers. It was extremely helpful. I ended up using the Inventory option with the prefix filter but the Storage lens sounds pretty cool. I'll have to check that out.

One thing to note is that using the prefix 'J CASES/' by itself didn't return any results. I actually had to use the full path like this: CBB_SERVER1/I$/J CASES/

The hard part was parsing everything out using Excel. I know this is not an Excel Q&A but for sake of completeness and to help out anyone else who may be looking at this, here's the additional steps I took. I'm sure there may be other ways to do this, it's just what I ended up doing.

  1. Remove extra characters like “CBB_SERVER1”, “K%20”, “%2C%20”, etc. I used the SUBSTITUTE command like this. =SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(B2, "CBB_SERVER1/I%24/K%20CASES/",""), "%2C%20", " "), "%20", " ")

  2. Now the folder names are still really long like this: JACKSON MICHAEL -- E DISCOVERY/20CR12345/20.05.21 Ecom 20 items JOHNSON TYLER -- EDISCOVERY/20CN56789/15ITEMS.ZIP%24/20200527174715 JONES SAM -- 21CR98765/15 ITEMS.zip%24/20210323035222

I had to remove everything after the person’s name. This can be difficult because of variable length names and the names of the files are all different. I tried to do some research and the easiest way I found was to use an Excel feature called FLASH FILL.

https://www.ablebits.com/office-addins-blog/remove-characters-from-string-excel/

  1. Next I had to consolidate the rows with duplicate names and add them all up to get the total size for the parent folder. I used the CONSOLIDATE feature in Excel under the Data tab. We're basically rolling up all of the subfolders into the parent folder.

https://trumpexcel.com/combine-duplicate-rows-and-sum-values-excel/

  1. This wasn't too bad so far but my next challenge was to look for duplicates between Open Cases and Closed Cases which proved to be a lot more difficult.
  2. First, I used this formula =G2&" "&H2 which basically combines the data in row G and row H and it puts a space in between to make it easier to read. In this example, row G is the person's name and row H is the Data Size. You end up with 1 column that includes both Name and Size.

JACKSON MICHAEL 6212016318 JOHNSON TYLER 510146178 JONES SAM 10198464966

  1. Repeat the same steps for the Closed Cases
  2. Use Conditional Formatting, Highlight Cell Rules, Duplicate Values. This step finds and highlights duplicate data anywhere in the two columns (Open and Closed).

https://trumpexcel.com/find-and-remove-duplicates-in-excel

answered 6 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions