Textract Document Analysis API. Train custom adapter or user baseline model? For >30 Custom queries.

0
  1. Assume we've 50-250 data points that need to be extracted from PDF files. Each PDF file may be 4-15 pages.
  2. The format and layout of each PDF file may be different. A datapoint we're searching for may be on page 1 of a PDF, or on page 4 of another PDF (and may not even exist in some).
  3. Not every PDF will have all 250 data points. Some may have only 50. Others may have 75-100. So, it's not efficient to run all 250 searches (queries) when there may be far less data points in the PDF being processed @ runtime.
  4. Tried baseline Document Analysis API. 60-70% accuracy in some instances and up to 95% in others. Reliability is a concern.
  5. Parsing entire JSON output from such API is cumbersome as there is A LOT of noise/useless text/information apart from these datapoints.
  6. Tried custom adapters. 30 queries limitation. not enough for us to train model to extract all relevant info.

Questions: 7. Can there be >30 custom queries in custom adapters? If yes, how? We're not sure if the restriction is on free/paid plan. 8. How is the cost calculated for above scenario? If we run 250 searches (queries) on a 15 page document that's 250 (queries) * 15 (pages) * $0.015 ($15 per 1000 pages) = $56.25. Correct?

*Note: In comparison, an average human takes about 4.5-8 min. to extract such info. and enter in xls.

  1. How do we optimize query search? For scenario in #3 above. Not every document will have 250 data points. So we don't need to run 250 queries each time.

thanks for suggestions & feedback.

asked 2 months ago91 views
2 Answers
1

I'd suggest to check out the new Bedrock Data Automation, launched in preview just yesterday at re:Invent. With BDA you can define "blueprints" of structure to extract from documents, and the service will use generative AI models to pull out the relevant fields.

I'm not sure what BDA's limit on number of fields is (250 is quite a lot), but you should be able to use it to create multiple "blueprints" to extract different fields in parallel, if wanted.

AWS
EXPERT
answered 2 months ago
0

Thank you for your detailed question about Amazon Textract's Document Analysis API and custom adapters. I'll address your questions and concerns point by point:

1-6. I understand your scenario involves extracting 50-250 data points from PDF files of varying lengths and layouts, with inconsistent data point presence across documents. You've experienced mixed results with the baseline Document Analysis API and found the JSON output cumbersome to parse.

  1. Regarding the limit of custom queries in adapters: Currently, Amazon Textract's Custom Queries feature has a limitation of 30 queries per adapter. This limit applies regardless of whether you're using a free or paid plan. There isn't a direct way to exceed this 30-query limit within a single adapter.

However, you might be able to work around this limitation by creating multiple adapters, each with up to 30 queries, and applying them sequentially or to different parts of your documents. This approach would require careful management of your adapters and potentially more complex integration with your workflow.

  1. Your cost calculation is generally correct, but there's an important distinction to make. The pricing for Amazon Textract's AnalyzeDocument API (which includes custom queries) is indeed $15 per 1,000 pages. However, this is per API call, not per query. So if you analyze a 15-page document once, regardless of how many queries you run, the cost would be:

15 (pages) * $0.015 = $0.225 per document

The number of queries doesn't directly affect the pricing in this way. You're charged for the document analysis, not per query executed.

  1. To optimize query searches for documents with varying numbers of data points, consider implementing a two-step approach:

a) First, use a set of "indicator" queries to determine which data points are likely present in the document. These could be broader, more general queries that help you identify the document's structure and content.

b) Based on the results of these indicator queries, dynamically select which specific data extraction queries to run. This way, you're only running relevant queries for each document, potentially saving processing time and improving efficiency.

Additionally, you might consider using the layout analysis feature of Amazon Textract to help identify where specific types of information are likely to be located in the document, which could further refine your query strategy.

Given the complexity of your use case and the current limitations of Custom Queries, you might also want to explore combining Textract with other AWS services like Amazon Comprehend for entity recognition or Amazon SageMaker for more customized machine learning models. This could potentially provide more flexibility in handling your diverse document types and large number of data points.

Lastly, keep in mind that Amazon Textract is continually evolving. While the current query limit is 30, it's worth keeping an eye on AWS announcements for potential future enhancements to the Custom Queries feature that might better accommodate your needs.

Sources
Customizing Outputs - Amazon Textract
Amazon Textract launches Custom Queries to improve information extraction for business-specific documents
Custom Queries tutorial - Amazon Textract
Customizing you Queries Responses - Amazon Textract

profile picture
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions