TextExtract Struggles

0

Still new to Text Extract, and I’m trying to get it to read a form and then through the Query feature get answers to predefined questions. Here are my specific questions:

  • Without specifying a specific page, how do I ensure TextExtract scans the whole document. Or if I have to specify pages, can i default the selection to include all pages?

  • How can I make TextExtract recognize when a checkbox is marked by an ‘X’ rather than a check box?

  • Is there a way to automatically request the same queries on a document uploaded to TextExtract?

  • How can I ensure that a query automatically searches the entirety of the document content, including forms, tables etc?

  • In response to an answer to a query, can TextExtract or an add-on AWS feature push pre-determined text? For example in response to the question: What is the name of the author? Can TextExtract or an add-on AWS feature push the text “His name is Jeff”?

asked a year ago276 views
1 Answer
0

Hi, appreciate this question is from some time ago now but hope the following might still be useful:

1/ Scanning whole document: As documented here, Amazon Textract will only scan the first page ["1"] for your Queries by default. If you want to scan all pages, you can set the Pages parameter of your query to ["*"].

2/ Checkboxes marked as 'X': Today, the selection elements feature is not customizable or fine-tunable: It generally should recognise a range of elements as outlined in the doc: From checkboxes to radio buttons to circled or crossed text. Assuming you have something like [X] as often used in MarkDown, I'd tentatively expect it to detect pretty well... But if it's not working for your particular documents then you'd need to explore another post-processing solution like Amazon Comprehend, rules-based logic, or custom ML models.

3/ Automatically attach queries: You need to specify your input queries on each call to run Amazon Textract (e.g. AnalyzeDocument or StartDocumentAnalysis) today. To run a fixed set of queries, you would handle this on your application side: Perhaps managing the configuration in a store like AWS AppConfig, SSM Parameter Store, or DynamoDB rather than hard-coding it in your app.

4/ Searching the whole document, including tables etc: As mentioned in 1/, you can configure the Pages parameter of your query to scan all pages of your document. Queries should already use the visual/layout information from the page when trying to answer your questions, so there should be no need to explicitly configure it to work together with the FORMS/TABLES features (or to enable those features on the request if you don't need them).

5/ Adding text to query answers: In general, Textract Queries answers questions using by extracting the content of the source document. If you just want to add fixed text to your detected results, I'd suggest to do this on application side and it should be pretty straightforward. If you want more open-ended question answering that transforms the source text (for example "What's the contract date in YYYY-MM-DD format?", or "Summarize the document", or "Is this invoice from before 2020?"), your use-case might be better suited for a Generative AI-based technology like Anthropic Claude 3+ on Amazon Bedrock.

AWS
EXPERT
answered 6 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions