1 Answer
- Newest
- Most votes
- Most comments
0
When dealing with tables where multiple entities exist in a single cell (like product code with description or UOM with quantity), AWS Textract's standard table extraction may not automatically separate these entities. Here are approaches to handle this challenge:
- Post-processing with pattern recognition: After extracting the table data with Textract's analyze_document API using the TABLES feature type, you can implement custom post-processing logic to separate the entities based on patterns:
- For product code and description: Look for patterns where codes might be numeric/alphanumeric at the beginning of the text followed by descriptive text
- For UOM and quantity: Search for common UOM abbreviations (EA, KG, LBS, etc.) and separate them from numeric values
- Use Amazon Bedrock for extraction: For more sophisticated extraction, you can use Amazon Bedrock's data extraction capabilities. You can define a schema with appropriate field descriptions that instructs the system how to identify and extract specific entities:
"properties": {
"product_code": {
"type": "string",
"inferenceType": "Explicit",
"description": "The product identifier code"
},
"product_description": {
"type": "string",
"inferenceType": "Explicit",
"description": "The full item description text"
},
"uom": {
"type": "string",
"inferenceType": "Explicit",
"description": "Unit of measure (EA, KG, etc.)"
},
"unit_price": {
"type": "number",
"inferenceType": "Explicit"
}
}
- Template-based approach: Since you mentioned that every customer has their own format, you could create template JSON files for different customer formats. After getting the Textract response, match it with the appropriate template to extract the correct entities. This involves:
- Creating a template JSON for each customer format
- Parsing the Textract response JSON
- Using pattern matching or regular expressions to separate the entities based on the template
- Custom queries with Textract: If you're working with consistent document formats, you could explore using Textract's query capabilities to specifically ask for the information you need from particular regions of the document.
For the most reliable results with varying formats, a combination of Textract's table extraction followed by custom post-processing logic tailored to each customer's format would likely yield the best results.
Sources
Creating Blueprints for Extraction - Amazon Bedrock
Tables - Amazon Textract
Analyzing Documents - Amazon Textract
Automatically extract content from PDF files using Amazon Textract - AWS Prescriptive Guidance
answered 12 days ago
Relevant content
- asked 7 months ago
