Hi everyone,
I'm working on a project using Amazon OpenSearch Serverless to store and search content based on tags. Both tags to filter and tags of content can be multiple. Documents, or contents, may not contain any tag field, too. I need to filter and search content based on the similarity of tags between tags to filte and the content. Similarity is decided in this way: if a tag have the same group_id, then it should have the same id and src_code.
Example Structure:
Here is an example of the tags field in the content documents:
"_source": {
"tags": [
{
"group_id": 5,
"id": 1,
"src_code": "2324"
},
{
"group_id": 6,
"id": 2,
"src_code": "91"
},
{
"group_id": 7,
"id": 3,
"src_code": "3124"
},
{
"group_id": 8,
"id": 4,
"src_code": "5543"
}
]
}
Tags to filter:
{
"filter_tags": [
{"id": 123, "group_id": 1, "src_code": "2"},
{"id": 124, "group_id": 2, "src_code": "23"},
{"id": 1, "group_id": 5, "src_code": "2324"},
{"id": 4, "group_id": 6, "src_code": "2325"},
]
}
The goal is to return documents where all filter_tags match the corresponding content tags based on group_id, id, and src_code. filter_tags may contains tags whose group_id is different than content, in that case I don't want to compare them.
With the given examples, I want to compare only 3rd and 4th filter_tags, since they content tags have the same group_id:
1st filter_tag -> don't compare since no matching group_id
2nd filter_tag -> don't compare since no matching group_id
3rd filter_tag -> compare since group_id matches -> score is 1 since id and src_code are equal.
4th filter_tag -> compare since group_id matches -> score is 0 since id and src_code are not equal.
Overall don't match the doc, since there is a tag with same group_id but different id and src_code.
What I have tried:
I tried the following two methods
- Without tag field exists check
POST /bedrock-knowledge-base-default-index/_search
{
"query": {
"bool": {
"should": [
{
"nested": {
"path": "tags",
"query": {
"function_score": {
"query": {
"match_all": {}
},
"functions": [
{
"script_score": {
"script": {
"source": """
double score = 0;
for (def filter_tag : params.filter_tags) {
boolean match = false;
for (def content_tag : doc['tags']) {
if (content_tag['group_id'].value == filter_tag.group_id &&
content_tag['id'].value == filter_tag.id &&
content_tag['src_code'].value == filter_tag.src_code) {
match = true;
break;
}
}
if (match) {
score += 1;
} else {
score = 0;
break;
}
}
return score;
""",
"params": {
"filter_tags": [
{"id": 1, "group_id": 5, "src_code": "2324"},
{"id": 2, "group_id": 6, "src_code": "91"}
]
}
}
}
}
],
"boost_mode": "replace"
}
}
}
}
]
}
},
"_source": ["tags"]
}
The error I got:
{
"error": {
"root_cause": [
{
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"org.opensearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:82)",
"org.opensearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:50)",
"for (def content_tag : doc['tags']) {\n ",
" ^---- HERE"
],
"script": " ...",
"lang": "painless",
"position": {
"offset": 221,
"start": 194,
"end": 262
}
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": []
},
"status": 400
}
I thought this is because some documents don't have tags field and added a condition to check the existence of the field:
POST /bedrock-knowledge-base-default-index/_search
{
"query": {
"bool": {
"should": [
{
"nested": {
"path": "tags",
"query": {
"function_score": {
"query": {
"match_all": {}
},
"functions": [
{
"script_score": {
"script": {
"source": """
# THIS IS THE ADDED CONDITION
if ((! doc.containsKey('tags') )|| doc['_source']['tags'].empty) {
return 0.8;
}
double score = 0;
for (def filter_tag : params.filter_tags) {
boolean match = false;
for (def content_tag : doc['tags']) {
if (content_tag['group_id'].value == filter_tag.group_id &&
content_tag['id'].value == filter_tag.id &&
content_tag['src_code'].value == filter_tag.src_code) {
match = true;
break;
}
}
if (match) {
score += 1;
} else {
score = 0;
break;
}
}
return score;
""",
"params": {
"filter_tags": [
{"id": 1, "group_id": 5, "src_code": "2324"},
{"id": 2, "group_id": 6, "src_code": "91"}
]
}
}
}
}
],
"boost_mode": "replace"
}
}
}
}
]
}
},
"_source": ["tags"]
}
But this time all of the documents, even those with tags field are returned with 0.8 score.
Asking for help for these cases
- Correct Null Check: How can I correctly check for the presence and non-emptiness of the tags field in the script?
- Correct Filtering: How can I filter tags based on my need: equal group id should have the equal id and src_code
- Suggestions for Document Structure: If there isn't any way or easy way to achieve the goal, are there any recommended changes to the structure of my documents to make such queries more straightforward.
doc.containsKey('tags') this check always returns FALSE, even though there are 'tags' key inside source. I am giving the score higher than zero when tags key doesn't exist because I want to include documents that doesn't have any tags. I want to apply the filtering condition to documents with tags. But I can't get those documents with tags key since it is not passing the first condition: if ((! doc.containsKey('tags') )|| doc['_source']['tags'].empty)
As you have menation, I tried to write the query by looking at documentations BUT they are giving the error in the post, even when ALL the documents have valid tags field.
I thought about flattening the tags but since there can be many tags and the values will be updated. But I couldn't come up with an algorithm to create keys that will stay the same for both content tags and and filter, despite the order and time of adding tags.
I would really appreciate if you could provide more soltuion if possible, especially for 1 and 2. I couldn't find any syntax error but reaching the key by using doc['key'] syntax is producing error for all keys.