How to list S3 directories (not objects) by date added?

1

You know how, when you're looking through files and folders on a Mac, you can see and sort by "Date Added", and folders are included in that? I know one approach is just to prefix directory names with a date, so that the date-sorting happens alphabetically.

But what about in situations where you don't have control over the naming of directories, like with automated exports from Amazon HealthLake, where you just want to quickly find the most recent export, without clicking into into the subdirectories to find a dated object?

I'm guessing someone must have some command line hacks, so I wanted to ask before coming up with a nice for-loop myself.

2 Answers
1
Accepted Answer

I agree with all that DALDEI stated above, and suggest you research further before using the technique outlined below, but for this particular use case, it may provide the results that you're looking for (one command to return the last modified folder):

aws s3api list-objects-v2 --bucket "some-bucket-name" --query 'sort_by(Contents, &LastModified)[-1].Key' | tr -d '"' | grep -oE '[^\"].*\/\s*'

This command first gets the last modified object via an s3api API call. The result can be a file or a "directory". The result is then piped to tr to strip the surrounding quotes, and then to grep to return the characters up to and including the last forward slash. So regardless if the last modified object is a "directory" or file, it returns the entire path to the "directory", including the trailing forward slash, minus the filename if it's a file.

Hope this helps!

Notes:

  • I am using "directory" in quotes consistent with the explanation given by DALDEI
  • I don't know how well this would perform with a large bucket, I only tested with a few nested folders/files so YMMV
AWS
vinceis
answered 2 years ago
1

There is a fundamental problem with this question -- the concept of "S3 Directories" does not actually exist. Therefore it is not really possible to implement exactly what was asked -- there simply is 'no such thing'.

However there are methods to approximate it.

There are 2 methods typically used to present the illusion of 'directories'.

Native Support: always works S3 natively supports the concept of a 'delimiter' -- but only in the ListObjects API call. The 'delimiter' is not a global property -- its an option to the API call and can differ between calls. When you specify a delimiter the API returns 2 lists, a "common prefix list" and a "key list". The "common prefix list" can be used to model "directories" -- e.g. it is a list of unique prefixes with one more delimiter then the start key given. E.g. if you list "/month/day" with '/' as the delimiter you may get "/month/day/01/" , "/month/day/02/" , "month/day/03/" in the "CommonPrefixes" list.

That allows you to do a incremental, hierarchical 'tree explorer' view -- to 'drill down' call ListObjects with a prefix of one of the listed common prefixes, and the result will be '1 level deeper' However -- this doesn't solve the stated problem - and it cannot, as these are not really directories -- they do not exist so they do not have any metadata. they are key prefixes associated with actual S3 objects (keys). In order to get 'last added' you need to query for all the S3 objects 'under' a given 'prefix' and get the most recent LastModified date. If you need to do this recursively (say a deeply nested directory) -- then this may take many API calls to collect all the child S3 objects.

Assisted Support: Due to the limits of the above - some software, including the AWS Console, 'invent' a convention of a special kind of 'Directory Object' Exactly what this object 'looks like' depends on the software that created it, and it also depends on that software to manage any concept of 'last updated' I do not know of any software that actually updates the 'directory object' every time new keys are added , but if you are in 100% control of all S3 object updates then you could explicitly update the 'directory object' every time a new object is updated. This will incur significant performance costs, and can only be guaranteed to be accurate if no other methods of uploading files are allowed.

To optimize for these issues, I have seen (and written) implementations that instead use the S3 object events, and store an external index to S3 that is kept up to date by processing all the S3 event messages (via AWS EventBridge events). A common way to do this is to use DynamoDB or Memcache/Redis to maintain an up to date index of S3 -- that may be significantly more efficient to keep up to date and to query -- and it avoids the problem of 'other software' updating the same S3 bucket as it does not rely on code during the object creation, rather it relies on events after object creation -- regardless of source. While this is possible, and can be made quite efficient, it is not easy to get right -- as the event system does not guarantee order of events. Handling things like out of order Update/Delete/Update messages correctly can be tricky. Also there is a unknown delay between the object upload and when the event processing code is complete so there are periods of time (seconds? more?) where the index will be behind. This also, (obviously), requires managing another data store and the software that maintains the index, make sure it scales properly, recovers from errors, manages permissions and access to the index, code to 'refresh the index' as needed etc. - this can be a significant long term effort.

DALDEI
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions