- Newest
- Most votes
- Most comments
Greeting
Hi Sharon!
Thanks for reaching out with your question about using AWS Textract for document analysis. It’s clear you’re working hard to make Textract’s Query features retrieve consistent results, so let’s dive in and get you a reliable solution.
Clarifying the Issue
You mentioned that some PDF files are not retrieving QUERY
or QUERY_RESULT
block types when processed through the @aws-sdk/client-textract
Node.js package, despite these same files working perfectly in the AWS Textract Console. Specific fields, such as GSTIN and Invoice Number, are inconsistently detected, creating challenges with uniform data extraction.
Looking at your invoice, the structure appears consistent, but differences in how the SDK and Console handle queries (or possibly the document layout itself) might explain these inconsistencies. Let’s explore how to ensure Textract retrieves your desired fields accurately across all your files.
Why This Matters
Textract’s ability to extract specific fields using queries is crucial for automating workflows like invoice processing. Consistency in detecting structured data like GSTIN and Invoice Numbers ensures that your system remains reliable, efficient, and scalable without manual intervention. Addressing this issue will allow you to process various invoices seamlessly, regardless of minor format variations.
Key Terms
- BlockType: Represents a type of data detected in the document, such as
QUERY
,QUERY_RESULT
,KEY_VALUE_SET
, or others. - Textract Console: The AWS web interface for Textract, often configured with defaults for processing documents.
- SDK: Software Development Kit; in this case, the
@aws-sdk/client-textract
package for Node.js. - Query: A feature in Textract to retrieve specific data points from documents, such as GSTIN or Invoice Numbers.
- OCR: Optical Character Recognition; technology used to convert scanned or printed text into machine-readable text.
The Solution (Our Recipe)
Steps at a Glance:
- Confirm Query Configuration in the Node.js Code.
- Enable Advanced OCR Settings in SDK Calls.
- Preprocess Documents for Consistency.
- Debug Outputs for Missing or Misclassified Blocks.
- Add Fallback Logic for Manual Key-Value Detection.
Step-by-Step Guide:
-
Confirm Query Configuration in the Node.js Code
Ensure you’ve defined queries correctly in your SDK call. Queries must be structured explicitly, with the correct text match logic.Example:
import { TextractClient, AnalyzeDocumentCommand } from "@aws-sdk/client-textract"; const client = new TextractClient({ region: "us-east-1" }); const analyzeParams = { Document: { Bytes: /* File data as buffer */ }, FeatureTypes: ["QUERIES"], QueriesConfig: { Queries: [ { Text: "GSTIN" }, { Text: "Invoice Number" } ] } }; const command = new AnalyzeDocumentCommand(analyzeParams); const response = await client.send(command); console.log(response.Blocks);
-
Enable Advanced OCR Settings in SDK Calls
The Textract Console may apply default configurations that are not explicit in your SDK calls. Specify the language and enhance text recognition:const analyzeParams = { Document: { Bytes: /* File data as buffer */ }, FeatureTypes: ["QUERIES"], QueriesConfig: { Queries: [ { Text: "GSTIN" }, { Text: "Invoice Number" } ] }, // Specify document language DocumentReadMode: "FORMS_AND_TABLES", // Enhance OCR LanguageCode: "en" // Adjust if the document uses a different language };
-
Preprocess Documents for Consistency
Some documents may fail due to layout or font variations. Preprocess the PDFs to standardize formats:- OCR Standardization: Use tools like Tesseract to convert PDFs to searchable text.
- Flatten PDFs: Flatten layers to simplify structure and avoid embedded fonts or hidden elements.
Example using Tesseract:
tesseract input.pdf output -l eng pdf
-
Debug Outputs for Missing or Misclassified Blocks
Enable detailed logging to analyze missing blocks or misclassified data. If the expected fields do not appear asQUERY_RESULT
, they may have been misclassified or missed due to layout issues.Example Debug Logging:
response.Blocks.forEach(block => { console.log(block.BlockType, block.Text, block.Query ? block.Query.Text : "N/A"); });
This can help identify if the query text is not matching or if the data is misclassified.
-
Add Fallback Logic for Manual Key-Value Detection
If Textract’s Query features still fail for certain documents, implement fallback logic to detect the key-value pairs manually.Example:
const keyValues = response.Blocks.filter(block => block.BlockType === "KEY_VALUE_SET"); keyValues.forEach(kv => { if (kv.EntityTypes.includes("KEY") && kv.Text.includes("GSTIN")) { console.log("GSTIN Found:", kv.Relationships[0].Text); } });
This ensures that fields like GSTIN or Invoice Numbers are retrieved even when queries fail.
Closing Thoughts
By combining precise query configurations, preprocessing steps, and fallback logic, you can resolve the inconsistencies in Textract’s QUERY
and QUERY_RESULT
block detection. Debugging and preprocessing will play a crucial role in ensuring consistent data extraction across all your invoice formats.
For additional information, check out these resources:
- Textract Developer Guide
- AnalyzeDocument API Reference
- Textract Query Language
- Using Pretrained OCR with Tesseract
Let me know if you’d like further assistance or if we should refine the solution further. I’m here to help! 😊
Farewell
I hope this enhanced response helps you resolve the issue, Sharon! If you encounter further challenges or need more code examples, feel free to reach out. Good luck automating your invoice processing! 🚀✨
Cheers,
Aaron 😊
Follow-Up Guidance for Textract Query Issue
Hi Sharon!
Thanks for sharing your code sample and the validation errors you're encountering. Let’s address these issues to help you move forward smoothly.
Clarifying the Errors
The UnexpectedParameter
errors indicate that DocumentReadMode
and LanguageCode
are not valid parameters for the AnalyzeDocumentCommand
in the @aws-sdk/client-textract
library. These keys were incorrectly included in the input object, likely based on Console-specific defaults that aren't explicitly supported in the SDK.
Additionally, you're processing a multi-page PDF, so it’s important to confirm that the queries are correctly configured for both pages. AWS Textract processes each page as a separate entity, and results may vary if a query’s target content is split across pages or located in a non-standard layout.
Adjusted Code Sample
Here’s a revised version of your code that removes unsupported parameters and improves handling for multi-page documents:
import { TextractClient, StartDocumentAnalysisCommand } from "@aws-sdk/client-textract"; const textractClient = new TextractClient({ region: process.env.AWS_REGION }); const input = { FeatureTypes: ["QUERIES"], QueriesConfig: { Queries: [{ Text: "GSTIN" }, { Text: "Invoice Number" }], }, DocumentLocation: { S3Object: { Bucket: process.env.AWS_S3_BUCKET || "", Name: fileKey, }, }, ClientRequestToken: `${Date.now()}`, NotificationChannel: { SNSTopicArn: process.env.AWS_SNS_TOPIC_ARN || "", RoleArn: process.env.AWS_ROLE_ARN || "", }, JobTag: "INVOICE_PROCESS", }; async function startDocumentAnalysis() { try { const command = new StartDocumentAnalysisCommand(input); const response = await textractClient.send(command); console.log("Job started successfully:", response.JobId); } catch (error) { console.error("Error processing invoice:", error); } } startDocumentAnalysis();
Key Changes:
- Removed Invalid Parameters: Removed
DocumentReadMode
andLanguageCode
. - Correct API Command: Used
StartDocumentAnalysisCommand
for asynchronous job processing, as it's necessary for multi-page PDF files. - ClientRequestToken: Added a unique token for better traceability of your jobs.
- NotificationChannel: Ensured proper configuration for S3 and SNS integration to receive job completion notifications.
Debugging Query Results
Once the job completes, retrieve the results using the GetDocumentAnalysisCommand
. Ensure you check all pages for QUERY
and QUERY_RESULT
block types:
import { GetDocumentAnalysisCommand } from "@aws-sdk/client-textract"; async function getJobResults(jobId) { try { const params = { JobId: jobId }; const command = new GetDocumentAnalysisCommand(params); const response = await textractClient.send(command); response.Blocks.forEach((block) => { if (block.BlockType === "QUERY_RESULT") { console.log(`Query: ${block.Query.Text}`); console.log(`Result: ${block.Text}`); } }); if (response.NextToken) { // Handle paginated results await getJobResults(response.NextToken); } } catch (error) { console.error("Error retrieving job results:", error); } }
Steps:
- Use the returned
JobId
to fetch results. - Iterate through all pages and capture
QUERY
andQUERY_RESULT
blocks. - Implement pagination handling with the
NextToken
.
Improving Results for Specific Fields
If GSTIN
or Invoice Number
queries still fail for some files:
- Refine Query Text: Use exact field labels or synonyms found in your documents (e.g., “GST Number” instead of “GSTIN”).
- Document Preprocessing: Convert PDFs to a simpler format using tools like Tesseract to standardize fonts, layers, and layouts:
tesseract input.pdf output -l eng pdf
- Fallback Logic: Use the
KEY_VALUE_SET
blocks to manually extract GSTIN or Invoice Numbers when queries don’t produce results.
Closing Thoughts
These adjustments should resolve the validation errors and improve consistency in detecting QUERY_RESULT
blocks. If you still encounter issues, consider preprocessing problematic PDFs or testing the Textract service with fewer features (e.g., disable tables or forms) to isolate potential layout-related problems.
Let me know how it goes or if you’d like further clarification on any part of this solution! 😊🚀
Cheers,
Aaron 😊
Relevant content
- asked 3 years ago
- asked 10 months ago
- AWS OFFICIALUpdated 5 months ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 6 months ago
- AWS OFFICIALUpdated 2 years ago
Hi Aaron Rose , Thank you for your valuable answer. I am getting validation error as follows : Error processing invoice: MultipleValidationErrors: There were 2 validation errors:
I am sharing my code sample basic here below . I am uploading the bill PDF with 2 pages - the bill of same type with 2 page pdf.
import { TextractClient, } from "@aws-sdk/client-textract"; import AWS from "aws-sdk";
const textract = new AWS.Textract(); const sqs = new AWS.SQS(); const textractClient = new TextractClient({ region: process.env.AWS_REGION });
}