Best practices for finding specific words in documents

0

Hi, One of my team's use cases with Textract is to use the presence of a certain word in the document to categorize the documents. We do this using a query - "Is <word> in the document?". We tested this extensively with several test documents and it never returned an incorrect response.

However, in production, we started noticing that Textract returned "Yes" for many documents even though the specific word was nowhere to be found in them. The confidence scores for these false positives go up to 65%, so setting a threshold on that can potentially reject true positives as well.

I wanted to know if there are any best practices around this type of query and how I can improve its accuracy. I would also appreciate any insight into how and where Textract finds values from the document, like whether it looks into metadata or employs any sort of edit distance matching to find words.

asked a year ago315 views
1 Answer
2

Hello,

Amazon Textract Queries uses a large language model under the hood so it may also pick up synonyms of the key word you are searching for. If you are mostly looking for exact word matches, I'd recommend trying the following techniques (against the raw text extracted by Textract):

  • Pattern matching
  • Levenshtein distance matching
AWS
NZ
answered a year ago
  • This is definitely a great option to consider, but it'll require some redesign of our generic service as it is set up to work only with queries right now. I was hoping for some way to improve our use of queries itself before focusing on bigger enhancements.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions