- Newest
- Most votes
- Most comments
I'd suggest to check out the new Bedrock Data Automation, launched in preview just yesterday at re:Invent. With BDA you can define "blueprints" of structure to extract from documents, and the service will use generative AI models to pull out the relevant fields.
I'm not sure what BDA's limit on number of fields is (250 is quite a lot), but you should be able to use it to create multiple "blueprints" to extract different fields in parallel, if wanted.
Thank you for your detailed question about Amazon Textract's Document Analysis API and custom adapters. I'll address your questions and concerns point by point:
1-6. I understand your scenario involves extracting 50-250 data points from PDF files of varying lengths and layouts, with inconsistent data point presence across documents. You've experienced mixed results with the baseline Document Analysis API and found the JSON output cumbersome to parse.
- Regarding the limit of custom queries in adapters: Currently, Amazon Textract's Custom Queries feature has a limitation of 30 queries per adapter. This limit applies regardless of whether you're using a free or paid plan. There isn't a direct way to exceed this 30-query limit within a single adapter.
However, you might be able to work around this limitation by creating multiple adapters, each with up to 30 queries, and applying them sequentially or to different parts of your documents. This approach would require careful management of your adapters and potentially more complex integration with your workflow.
- Your cost calculation is generally correct, but there's an important distinction to make. The pricing for Amazon Textract's AnalyzeDocument API (which includes custom queries) is indeed $15 per 1,000 pages. However, this is per API call, not per query. So if you analyze a 15-page document once, regardless of how many queries you run, the cost would be:
15 (pages) * $0.015 = $0.225 per document
The number of queries doesn't directly affect the pricing in this way. You're charged for the document analysis, not per query executed.
- To optimize query searches for documents with varying numbers of data points, consider implementing a two-step approach:
a) First, use a set of "indicator" queries to determine which data points are likely present in the document. These could be broader, more general queries that help you identify the document's structure and content.
b) Based on the results of these indicator queries, dynamically select which specific data extraction queries to run. This way, you're only running relevant queries for each document, potentially saving processing time and improving efficiency.
Additionally, you might consider using the layout analysis feature of Amazon Textract to help identify where specific types of information are likely to be located in the document, which could further refine your query strategy.
Given the complexity of your use case and the current limitations of Custom Queries, you might also want to explore combining Textract with other AWS services like Amazon Comprehend for entity recognition or Amazon SageMaker for more customized machine learning models. This could potentially provide more flexibility in handling your diverse document types and large number of data points.
Lastly, keep in mind that Amazon Textract is continually evolving. While the current query limit is 30, it's worth keeping an eye on AWS announcements for potential future enhancements to the Custom Queries feature that might better accommodate your needs.
Sources
Customizing Outputs - Amazon Textract
Amazon Textract launches Custom Queries to improve information extraction for business-specific documents
Custom Queries tutorial - Amazon Textract
Customizing you Queries Responses - Amazon Textract
Relevant content
- asked 5 months ago
- asked 3 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 6 months ago