s3 jmespath not matching as expected

0

Edit (2022-07-18) To debug I have shared more of my code and setup a public s3 bucket.

And in the process it seems to be working as expected after a few learnings and tweaks. Current state is pasted below for context.

Code:

import json

import boto3


s3 = boto3.client("s3")
s3_paginator = s3.get_paginator("list_objects_v2")
s3_iterator = s3_paginator.paginate(Bucket="jmespath-not-matching-as-expected")

# [Output 1]: All keys (with content, skips empty files and folders)
for key_data in s3_iterator:
    print("[Output 1]")
    print(
        json.dumps(
            {
                k: [{"Key": i["Key"]} for i in v if i["Size"] != 0]
                for k, v in key_data.items()
                if k == "Contents"
            }
        )
    )

# Query which only returns a subset of interest
query_starts_with = f" && starts_with(Key,`sub_folder`)"
# Note: Previously successfully processed folders have had an empty file 'processed' added to paths
# Trying to move/archive is too slow, easier to have a query skip over what has already been processed
processed = [
    i.split("/processed")[0]
    for i in s3_iterator.search("Contents[?contains(Key,'processed')].Key")
]
query_unprocessed = (
    f" && ({' && '.join([f'!contains(Key, `{p}`)' for p in processed])})"
    if processed
    else ""
)
# [Output 2]: Subset of keys (only those with content and not contained in a subfolder already processed)
for key_data in s3_iterator.search(
    f"Contents[?Size!=`0`{query_starts_with}{query_unprocessed}].Key"
):
    print("[Query]")
    print(f"Contents[?Size!=`0`{query_starts_with}{query_unprocessed}].Key")

    print("[Output 2]")
    print(key_data)

Query:

Contents[?Size!=`0` && starts_with(Key,`sub_folder`) && (!contains(Key, `sub_folder/should_not_display/38286a88-eef4-4f04-827c-f7809bc04f31`) && !contains(Key, `sub_folder/should_not_display/c86fae3b-9987-42a7-b876-d995ce66a9f9`))].Key

Outputs:

[Output 1]
{
  "Contents": [
    {
      "Key": "sub_folder/should_display/random"
    },
    {
      "Key": "sub_folder/should_not_display/38286a88-eef4-4f04-827c-f7809bc04f31/random"
    },
    {
      "Key": "sub_folder/should_not_display/c86fae3b-9987-42a7-b876-d995ce66a9f9/random"
    },
    {
      "Key": "this/should_not_display/random"
    }
  ]
}

[Output 2]
sub_folder/should_display/random

Original Question:

Using boto3 (s3 client) I have tried the following:

query = "Contents[?starts_with(Key,`sub_folder`) && (!contains(Key, `sub_folder/should_not_display/38286a88-eef4-4f04-827c-f7809bc04f31`) && !contains(Key, `sub_folder/should_not_display/c86fae3b-9987-42a7-b876-d995ce66a9f9`))].Key"

s3_paginator = s3.get_paginator("list_objects_v2")
s3_iterator = s3_paginator.paginate(Bucket=S3_BUCKET)

for key_data in s3_iterator.search(query):
    print(key_data)
    # Should only print s3 objects from subfolder that do not contain:
    #  - sub_folder/should_not_display/38286a88-eef4-4f04-827c-f7809bc04f31 or 
    #  - sub_folder/should_not_display/c86fae3b-9987-42a7-b876-d995ce66a9f9 in thier path

Illustration of s3 folder structure:

{
  "Contents": [
    {"Key": "sub_folder/should_not_display/38286a88-eef4-4f04-827c-f7809bc04f31"},
    {"Key": "sub_folder/should_not_display/c86fae3b-9987-42a7-b876-d995ce66a9f9"},
    {"Key": "this/should_not_display"},
    {"Key": "sub_folder/should_display"}
  ]
}

When I paste the above query and json illustration into https://jmespath.org/ it works as intended. And yet, with the s3 boto3 client library I am not getting any matches where I should

  • Your query syntax works fine for me. Does s3_iterator have expected contests?

  • Despite things working as expected in this toy example, it is still not working in my real world scenario. Can't locate what the problem is. For now I am just going to reprocess everything available weekly but expire s3 contents after 6 days.

asked 3 years ago231 views
1 Answer
0

To resolve the issue with filtering S3 objects using JMESPath in boto3 you should iterate through each page of results returned by paginate() and apply your JMESPath query directly on the Contents array within each page.

Example Code:

import boto3
import jmespath

s3 = boto3.client("s3")
paginator = s3.get_paginator("list_objects_v2")
query = "Contents[?Size!=`0` && starts_with(Key, `sub_folder`) && (!contains(Key, `sub_folder/should_not_display/38286a88-eef4-4f04-827c-f7809bc04f31`) && !contains(Key, `sub_folder/should_not_display/c86fae3b-9987-42a7-b876-d995ce66a9f9`))].Key"

for page in paginator.paginate(Bucket="your_bucket_name"):
    matched_keys = jmespath.search(query, page['Contents'])
    for key in matched_keys:
        print(key)

Also, verify that the JSON structure matches what your JMESPath query expects (Contents as a top-level key containing Key values).

profile picture
EXPERT
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions