Edit (2022-07-18)
To debug I have shared more of my code and setup a public s3 bucket.
And in the process it seems to be working as expected after a few learnings and tweaks. Current state is pasted below for context.
Code:
import json
import boto3
s3 = boto3.client("s3")
s3_paginator = s3.get_paginator("list_objects_v2")
s3_iterator = s3_paginator.paginate(Bucket="jmespath-not-matching-as-expected")
# [Output 1]: All keys (with content, skips empty files and folders)
for key_data in s3_iterator:
print("[Output 1]")
print(
json.dumps(
{
k: [{"Key": i["Key"]} for i in v if i["Size"] != 0]
for k, v in key_data.items()
if k == "Contents"
}
)
)
# Query which only returns a subset of interest
query_starts_with = f" && starts_with(Key,`sub_folder`)"
# Note: Previously successfully processed folders have had an empty file 'processed' added to paths
# Trying to move/archive is too slow, easier to have a query skip over what has already been processed
processed = [
i.split("/processed")[0]
for i in s3_iterator.search("Contents[?contains(Key,'processed')].Key")
]
query_unprocessed = (
f" && ({' && '.join([f'!contains(Key, `{p}`)' for p in processed])})"
if processed
else ""
)
# [Output 2]: Subset of keys (only those with content and not contained in a subfolder already processed)
for key_data in s3_iterator.search(
f"Contents[?Size!=`0`{query_starts_with}{query_unprocessed}].Key"
):
print("[Query]")
print(f"Contents[?Size!=`0`{query_starts_with}{query_unprocessed}].Key")
print("[Output 2]")
print(key_data)
Query:
Contents[?Size!=`0` && starts_with(Key,`sub_folder`) && (!contains(Key, `sub_folder/should_not_display/38286a88-eef4-4f04-827c-f7809bc04f31`) && !contains(Key, `sub_folder/should_not_display/c86fae3b-9987-42a7-b876-d995ce66a9f9`))].Key
Outputs:
[Output 1]
{
"Contents": [
{
"Key": "sub_folder/should_display/random"
},
{
"Key": "sub_folder/should_not_display/38286a88-eef4-4f04-827c-f7809bc04f31/random"
},
{
"Key": "sub_folder/should_not_display/c86fae3b-9987-42a7-b876-d995ce66a9f9/random"
},
{
"Key": "this/should_not_display/random"
}
]
}
[Output 2]
sub_folder/should_display/random
Original Question:
Using boto3 (s3 client) I have tried the following:
query = "Contents[?starts_with(Key,`sub_folder`) && (!contains(Key, `sub_folder/should_not_display/38286a88-eef4-4f04-827c-f7809bc04f31`) && !contains(Key, `sub_folder/should_not_display/c86fae3b-9987-42a7-b876-d995ce66a9f9`))].Key"
s3_paginator = s3.get_paginator("list_objects_v2")
s3_iterator = s3_paginator.paginate(Bucket=S3_BUCKET)
for key_data in s3_iterator.search(query):
print(key_data)
# Should only print s3 objects from subfolder that do not contain:
# - sub_folder/should_not_display/38286a88-eef4-4f04-827c-f7809bc04f31 or
# - sub_folder/should_not_display/c86fae3b-9987-42a7-b876-d995ce66a9f9 in thier path
Illustration of s3 folder structure:
{
"Contents": [
{"Key": "sub_folder/should_not_display/38286a88-eef4-4f04-827c-f7809bc04f31"},
{"Key": "sub_folder/should_not_display/c86fae3b-9987-42a7-b876-d995ce66a9f9"},
{"Key": "this/should_not_display"},
{"Key": "sub_folder/should_display"}
]
}
When I paste the above query and json illustration into https://jmespath.org/ it works as intended. And yet, with the s3 boto3 client library I am not getting any matches where I should
Your query syntax works fine for me. Does
s3_iterator
have expected contests?Despite things working as expected in this toy example, it is still not working in my real world scenario. Can't locate what the problem is. For now I am just going to reprocess everything available weekly but expire s3 contents after 6 days.