Skip to content

Handling Nested JSON Field Searches in OpenSearch Serverless with Null Checks

0

Hi everyone,

I'm working on a project using Amazon OpenSearch Serverless to store and search content based on tags. Both tags to filter and tags of content can be multiple. Documents, or contents, may not contain any tag field, too. I need to filter and search content based on the similarity of tags between tags to filte and the content. Similarity is decided in this way: if a tag have the same group_id, then it should have the same id and src_code.

Example Structure:

Here is an example of the tags field in the content documents:

"_source": {
          "tags": [
            {
              "group_id": 5,
              "id": 1,
              "src_code": "2324"
            },
            {
              "group_id": 6,
              "id": 2,
              "src_code": "91"
            },
            {
              "group_id": 7,
              "id": 3,
              "src_code": "3124"
            },
            {
              "group_id": 8,
              "id": 4,
              "src_code": "5543"
            }
          ]
        }

Tags to filter:

{
  "filter_tags": [
    {"id": 123, "group_id": 1, "src_code": "2"},
    {"id": 124, "group_id": 2, "src_code": "23"},
    {"id": 1, "group_id": 5, "src_code": "2324"},
    {"id": 4, "group_id": 6, "src_code": "2325"},
  ]
}

The goal is to return documents where all filter_tags match the corresponding content tags based on group_id, id, and src_code. filter_tags may contains tags whose group_id is different than content, in that case I don't want to compare them.

With the given examples, I want to compare only 3rd and 4th filter_tags, since they content tags have the same group_id: 1st filter_tag -> don't compare since no matching group_id 2nd filter_tag -> don't compare since no matching group_id 3rd filter_tag -> compare since group_id matches -> score is 1 since id and src_code are equal. 4th filter_tag -> compare since group_id matches -> score is 0 since id and src_code are not equal.

Overall don't match the doc, since there is a tag with same group_id but different id and src_code.

What I have tried:

I tried the following two methods

  1. Without tag field exists check
POST /bedrock-knowledge-base-default-index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "nested": {
            "path": "tags",
            "query": {
              "function_score": {
                "query": {
                  "match_all": {}
                },
                "functions": [
                  {
                    "script_score": {
                      "script": {
                        "source": """
                          double score = 0;
                          for (def filter_tag : params.filter_tags) {
                            boolean match = false;
                            for (def content_tag : doc['tags']) {
                              if (content_tag['group_id'].value == filter_tag.group_id &&
                                  content_tag['id'].value == filter_tag.id &&
                                  content_tag['src_code'].value == filter_tag.src_code) {
                                match = true;
                                break;
                              }
                            }
                            if (match) {
                              score += 1;
                            } else {
                              score = 0;
                              break;
                            }
                          }
                          return score;
                        """,
                        "params": {
                          "filter_tags": [
                            {"id": 1, "group_id": 5, "src_code": "2324"},
                            {"id": 2, "group_id": 6, "src_code": "91"}
                          ]
                        }
                      }
                    }
                  }
                ],
                "boost_mode": "replace"
              }
            }
          }
        }
      ]
    }
  },
  "_source": ["tags"]
}

The error I got:

{
  "error": {
    "root_cause": [
      {
        "type": "script_exception",
        "reason": "runtime error",
        "script_stack": [
          "org.opensearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:82)",
          "org.opensearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:50)",
          "for (def content_tag : doc['tags']) {\n                              ",
          "                           ^---- HERE"
        ],
        "script": " ...",
        "lang": "painless",
        "position": {
          "offset": 221,
          "start": 194,
          "end": 262
        }
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": []
  },
  "status": 400
}

I thought this is because some documents don't have tags field and added a condition to check the existence of the field:

POST /bedrock-knowledge-base-default-index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "nested": {
            "path": "tags",
            "query": {
              "function_score": {
                "query": {
                  "match_all": {}
                },
                "functions": [
                  {
                    "script_score": {
                      "script": {
                        "source": """

                         # THIS IS THE ADDED CONDITION

                         if ((! doc.containsKey('tags') )|| doc['_source']['tags'].empty) {
                            return 0.8;
                          }



                          double score = 0;
                          for (def filter_tag : params.filter_tags) {
                            boolean match = false;
                            for (def content_tag : doc['tags']) {
                              if (content_tag['group_id'].value == filter_tag.group_id &&
                                  content_tag['id'].value == filter_tag.id &&
                                  content_tag['src_code'].value == filter_tag.src_code) {
                                match = true;
                                break;
                              }
                            }
                            if (match) {
                              score += 1;
                            } else {
                              score = 0;
                              break;
                            }
                          }
                          return score;
                        """,
                        "params": {
                          "filter_tags": [
                            {"id": 1, "group_id": 5, "src_code": "2324"},
                            {"id": 2, "group_id": 6, "src_code": "91"}
                          ]
                        }
                      }
                    }
                  }
                ],
                "boost_mode": "replace"
              }
            }
          }
        }
      ]
    }
  },
  "_source": ["tags"]
}

But this time all of the documents, even those with tags field are returned with 0.8 score.

Asking for help for these cases

  1. Correct Null Check: How can I correctly check for the presence and non-emptiness of the tags field in the script?
  2. Correct Filtering: How can I filter tags based on my need: equal group id should have the equal id and src_code
  3. Suggestions for Document Structure: If there isn't any way or easy way to achieve the goal, are there any recommended changes to the structure of my documents to make such queries more straightforward.
1 Answer
0

As per your questions, I will go one by one in replying to them:

  • Correct Null Check: In OpenSearch, you can check if a field exists and is not empty using the exists query. However, in a script, you can use the doc.containsKey('tags') to check if the field exists. If the field exists but is empty, you can check it with doc['tags'].empty. Your condition seems correct, but the issue might be with the scoring. When the tags field doesn’t exist or is empty, you’re returning a score of 0.8 which might be causing all documents to have a score of 0.8.
  • Correct Filtering: Your current script seems to be on the right track for filtering based on your needs. You’re correctly iterating over the filter tags and the document tags and comparing the group_id, id, and src_code. If all these fields match, you’re incrementing the score, and if they don’t, you’re setting the score to 0 and breaking the loop. This should work as expected, given that the tags field exists and is not empty.
  • Suggestions for Document Structure: Your current document structure seems fine for your needs. However, if you find that this structure is causing performance issues or if the complexity of the queries is too high, you might consider flattening the tags into separate fields (e.g., tag_group_id, tag_id, tag_src_code) or using a different data structure that better suits your needs.
EXPERT
answered a year ago
    1. doc.containsKey('tags') this check always returns FALSE, even though there are 'tags' key inside source. I am giving the score higher than zero when tags key doesn't exist because I want to include documents that doesn't have any tags. I want to apply the filtering condition to documents with tags. But I can't get those documents with tags key since it is not passing the first condition: if ((! doc.containsKey('tags') )|| doc['_source']['tags'].empty)

    2. As you have menation, I tried to write the query by looking at documentations BUT they are giving the error in the post, even when ALL the documents have valid tags field.

    3. I thought about flattening the tags but since there can be many tags and the values will be updated. But I couldn't come up with an algorithm to create keys that will stay the same for both content tags and and filter, despite the order and time of adding tags.

    I would really appreciate if you could provide more soltuion if possible, especially for 1 and 2. I couldn't find any syntax error but reaching the key by using doc['key'] syntax is producing error for all keys.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.