Skip to content

How to extract product code, description, UOM, and price when multiple entities exist in a single cell using AWS Textract?

0

Hello,

I’ve attached an image and the corresponding table extracted by AWS Textract. As you can see, the table rows are somewhat collateral — meaning that multiple pieces of information are grouped into a single row.

My goal is to extract the following entities:

  • Product code
  • Product description
  • Unit of measure (UOM)
  • Unit price

However, I’m facing this issue:

  • The product code and description are merged together in the same column.
  • Similarly, the UOM and **quantity **appear together under one field.

Note: Every customer has its own format Is there any reliable way in AWS Textract to separate or extract these entities individually?

Currently i m using textract **analyze_document API **with featureType 'Tables'

Any guidance or best practices would be greatly appreciated.

actual image Textract table

Thank you

asked 12 days ago33 views
1 Answer
0

When dealing with tables where multiple entities exist in a single cell (like product code with description or UOM with quantity), AWS Textract's standard table extraction may not automatically separate these entities. Here are approaches to handle this challenge:

  1. Post-processing with pattern recognition: After extracting the table data with Textract's analyze_document API using the TABLES feature type, you can implement custom post-processing logic to separate the entities based on patterns:
  • For product code and description: Look for patterns where codes might be numeric/alphanumeric at the beginning of the text followed by descriptive text
  • For UOM and quantity: Search for common UOM abbreviations (EA, KG, LBS, etc.) and separate them from numeric values
  1. Use Amazon Bedrock for extraction: For more sophisticated extraction, you can use Amazon Bedrock's data extraction capabilities. You can define a schema with appropriate field descriptions that instructs the system how to identify and extract specific entities:
"properties": {
"product_code": {
"type": "string",
"inferenceType": "Explicit",
"description": "The product identifier code"
},
"product_description": {
"type": "string",
"inferenceType": "Explicit",
"description": "The full item description text"
},
"uom": {
"type": "string",
"inferenceType": "Explicit",
"description": "Unit of measure (EA, KG, etc.)"
},
"unit_price": {
"type": "number",
"inferenceType": "Explicit"
}
}
  1. Template-based approach: Since you mentioned that every customer has their own format, you could create template JSON files for different customer formats. After getting the Textract response, match it with the appropriate template to extract the correct entities. This involves:
  • Creating a template JSON for each customer format
  • Parsing the Textract response JSON
  • Using pattern matching or regular expressions to separate the entities based on the template
  1. Custom queries with Textract: If you're working with consistent document formats, you could explore using Textract's query capabilities to specifically ask for the information you need from particular regions of the document.

For the most reliable results with varying formats, a combination of Textract's table extraction followed by custom post-processing logic tailored to each customer's format would likely yield the best results.
Sources
Creating Blueprints for Extraction - Amazon Bedrock
Tables - Amazon Textract
Analyzing Documents - Amazon Textract
Automatically extract content from PDF files using Amazon Textract - AWS Prescriptive Guidance

answered 12 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.