Hi,
One of my team's use cases with Textract is to use the presence of a certain word in the document to categorize the documents. We do this using a query - "Is <word> in the document?". We tested this extensively with several test documents and it never returned an incorrect response.
However, in production, we started noticing that Textract returned "Yes" for many documents even though the specific word was nowhere to be found in them. The confidence scores for these false positives go up to 65%, so setting a threshold on that can potentially reject true positives as well.
I wanted to know if there are any best practices around this type of query and how I can improve its accuracy. I would also appreciate any insight into how and where Textract finds values from the document, like whether it looks into metadata or employs any sort of edit distance matching to find words.
This is definitely a great option to consider, but it'll require some redesign of our generic service as it is set up to work only with queries right now. I was hoping for some way to improve our use of queries itself before focusing on bigger enhancements.