Best practices for finding specific words in documents

0

Hi, One of my team's use cases with Textract is to use the presence of a certain word in the document to categorize the documents. We do this using a query - "Is <word> in the document?". We tested this extensively with several test documents and it never returned an incorrect response.

However, in production, we started noticing that Textract returned "Yes" for many documents even though the specific word was nowhere to be found in them. The confidence scores for these false positives go up to 65%, so setting a threshold on that can potentially reject true positives as well.

I wanted to know if there are any best practices around this type of query and how I can improve its accuracy. I would also appreciate any insight into how and where Textract finds values from the document, like whether it looks into metadata or employs any sort of edit distance matching to find words.

질문됨 일 년 전327회 조회
1개 답변
2

Hello,

Amazon Textract Queries uses a large language model under the hood so it may also pick up synonyms of the key word you are searching for. If you are mostly looking for exact word matches, I'd recommend trying the following techniques (against the raw text extracted by Textract):

  • Pattern matching
  • Levenshtein distance matching
AWS
NZ
답변함 일 년 전
  • This is definitely a great option to consider, but it'll require some redesign of our generic service as it is set up to work only with queries right now. I was hoping for some way to improve our use of queries itself before focusing on bigger enhancements.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠