Best practices for finding specific words in documents

0

Hi, One of my team's use cases with Textract is to use the presence of a certain word in the document to categorize the documents. We do this using a query - "Is <word> in the document?". We tested this extensively with several test documents and it never returned an incorrect response.

However, in production, we started noticing that Textract returned "Yes" for many documents even though the specific word was nowhere to be found in them. The confidence scores for these false positives go up to 65%, so setting a threshold on that can potentially reject true positives as well.

I wanted to know if there are any best practices around this type of query and how I can improve its accuracy. I would also appreciate any insight into how and where Textract finds values from the document, like whether it looks into metadata or employs any sort of edit distance matching to find words.

已提问 1 年前329 查看次数
1 回答
2

Hello,

Amazon Textract Queries uses a large language model under the hood so it may also pick up synonyms of the key word you are searching for. If you are mostly looking for exact word matches, I'd recommend trying the following techniques (against the raw text extracted by Textract):

  • Pattern matching
  • Levenshtein distance matching
AWS
NZ
已回答 1 年前
  • This is definitely a great option to consider, but it'll require some redesign of our generic service as it is set up to work only with queries right now. I was hoping for some way to improve our use of queries itself before focusing on bigger enhancements.

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则

相关内容