Textract Document Analysis not retrieving query output.

1

Textract is not detecting BlockType 'QUERY' or 'QUERY_RESULT' from some PDF files.. Have uploaded in AWS Textract environment in the webpage and is getting the output for the query question. But same file is not getting output in the textract from node js package @aws-sdk/client-textract. It is working on some files and not some. Pls give a solution. Have attached the bill file for the reference . Mainly the GSTIN and INVOICE NUMBER is not getting detected in couple of files similar to this invoice format .Bill File

asked a month ago70 views
2 Answers
0

Greeting

Hi Sharon!

Thanks for reaching out with your question about using AWS Textract for document analysis. It’s clear you’re working hard to make Textract’s Query features retrieve consistent results, so let’s dive in and get you a reliable solution.


Clarifying the Issue

You mentioned that some PDF files are not retrieving QUERY or QUERY_RESULT block types when processed through the @aws-sdk/client-textract Node.js package, despite these same files working perfectly in the AWS Textract Console. Specific fields, such as GSTIN and Invoice Number, are inconsistently detected, creating challenges with uniform data extraction.

Looking at your invoice, the structure appears consistent, but differences in how the SDK and Console handle queries (or possibly the document layout itself) might explain these inconsistencies. Let’s explore how to ensure Textract retrieves your desired fields accurately across all your files.


Why This Matters

Textract’s ability to extract specific fields using queries is crucial for automating workflows like invoice processing. Consistency in detecting structured data like GSTIN and Invoice Numbers ensures that your system remains reliable, efficient, and scalable without manual intervention. Addressing this issue will allow you to process various invoices seamlessly, regardless of minor format variations.


Key Terms

  • BlockType: Represents a type of data detected in the document, such as QUERY, QUERY_RESULT, KEY_VALUE_SET, or others.
  • Textract Console: The AWS web interface for Textract, often configured with defaults for processing documents.
  • SDK: Software Development Kit; in this case, the @aws-sdk/client-textract package for Node.js.
  • Query: A feature in Textract to retrieve specific data points from documents, such as GSTIN or Invoice Numbers.
  • OCR: Optical Character Recognition; technology used to convert scanned or printed text into machine-readable text.

The Solution (Our Recipe)

Steps at a Glance:

  1. Confirm Query Configuration in the Node.js Code.
  2. Enable Advanced OCR Settings in SDK Calls.
  3. Preprocess Documents for Consistency.
  4. Debug Outputs for Missing or Misclassified Blocks.
  5. Add Fallback Logic for Manual Key-Value Detection.

Step-by-Step Guide:

  1. Confirm Query Configuration in the Node.js Code
    Ensure you’ve defined queries correctly in your SDK call. Queries must be structured explicitly, with the correct text match logic.

    Example:

    import { TextractClient, AnalyzeDocumentCommand } from "@aws-sdk/client-textract";
    
    const client = new TextractClient({ region: "us-east-1" });
    
    const analyzeParams = {
        Document: {
            Bytes: /* File data as buffer */
        },
        FeatureTypes: ["QUERIES"],
        QueriesConfig: {
            Queries: [
                { Text: "GSTIN" },
                { Text: "Invoice Number" }
            ]
        }
    };
    
    const command = new AnalyzeDocumentCommand(analyzeParams);
    const response = await client.send(command);
    
    console.log(response.Blocks);

  1. Enable Advanced OCR Settings in SDK Calls
    The Textract Console may apply default configurations that are not explicit in your SDK calls. Specify the language and enhance text recognition:

    const analyzeParams = {
        Document: {
            Bytes: /* File data as buffer */
        },
        FeatureTypes: ["QUERIES"],
        QueriesConfig: {
            Queries: [
                { Text: "GSTIN" },
                { Text: "Invoice Number" }
            ]
        },
        // Specify document language
        DocumentReadMode: "FORMS_AND_TABLES", // Enhance OCR
        LanguageCode: "en" // Adjust if the document uses a different language
    };
  2. Preprocess Documents for Consistency
    Some documents may fail due to layout or font variations. Preprocess the PDFs to standardize formats:

    • OCR Standardization: Use tools like Tesseract to convert PDFs to searchable text.
    • Flatten PDFs: Flatten layers to simplify structure and avoid embedded fonts or hidden elements.

    Example using Tesseract:

    tesseract input.pdf output -l eng pdf

  1. Debug Outputs for Missing or Misclassified Blocks
    Enable detailed logging to analyze missing blocks or misclassified data. If the expected fields do not appear as QUERY_RESULT, they may have been misclassified or missed due to layout issues.

    Example Debug Logging:

    response.Blocks.forEach(block => {
        console.log(block.BlockType, block.Text, block.Query ? block.Query.Text : "N/A");
    });

This can help identify if the query text is not matching or if the data is misclassified.

  1. Add Fallback Logic for Manual Key-Value Detection
    If Textract’s Query features still fail for certain documents, implement fallback logic to detect the key-value pairs manually.

    Example:

    const keyValues = response.Blocks.filter(block => block.BlockType === "KEY_VALUE_SET");
    keyValues.forEach(kv => {
        if (kv.EntityTypes.includes("KEY") && kv.Text.includes("GSTIN")) {
            console.log("GSTIN Found:", kv.Relationships[0].Text);
        }
    });

    This ensures that fields like GSTIN or Invoice Numbers are retrieved even when queries fail.


Closing Thoughts

By combining precise query configurations, preprocessing steps, and fallback logic, you can resolve the inconsistencies in Textract’s QUERY and QUERY_RESULT block detection. Debugging and preprocessing will play a crucial role in ensuring consistent data extraction across all your invoice formats.

For additional information, check out these resources:

Let me know if you’d like further assistance or if we should refine the solution further. I’m here to help! 😊


Farewell

I hope this enhanced response helps you resolve the issue, Sharon! If you encounter further challenges or need more code examples, feel free to reach out. Good luck automating your invoice processing! 🚀✨


Cheers,

Aaron 😊

profile picture
answered a month ago
  • Hi Aaron Rose , Thank you for your valuable answer. I am getting validation error as follows : Error processing invoice: MultipleValidationErrors: There were 2 validation errors:

    • UnexpectedParameter: Unexpected key 'DocumentReadMode' found in params
    • UnexpectedParameter: Unexpected key 'LanguageCode' found in params

    I am sharing my code sample basic here below . I am uploading the bill PDF with 2 pages - the bill of same type with 2 page pdf.

    import { TextractClient, } from "@aws-sdk/client-textract"; import AWS from "aws-sdk";

    const textract = new AWS.Textract(); const sqs = new AWS.SQS(); const textractClient = new TextractClient({ region: process.env.AWS_REGION });

    const input = {
        FeatureTypes: ["QUERIES"],
        QueriesConfig: {
            Queries: [{ Text: "GSTIN" }, { Text: "Invoice Number" }],
        },
        // Specify document language
        DocumentReadMode: "FORMS_AND_TABLES", // Enhance OCR
        LanguageCode: "en", // Adjust if the document uses a different language
        DocumentLocation: { S3Object: { Bucket: AWS_S3_BUCKET ?? "", Name: fileKey } },
        ClientRequestToken:  Date.now(),
        NotificationChannel: {
            SNSTopicArn: AWS_SNS_TOPIC_ARN ?? "",
            RoleArn: AWS_ROLE_ARN ?? "",
        },
        JobTag: "INVOICE_PROCESS"
    

    }

0

Follow-Up Guidance for Textract Query Issue

Hi Sharon!

Thanks for sharing your code sample and the validation errors you're encountering. Let’s address these issues to help you move forward smoothly.


Clarifying the Errors

The UnexpectedParameter errors indicate that DocumentReadMode and LanguageCode are not valid parameters for the AnalyzeDocumentCommand in the @aws-sdk/client-textract library. These keys were incorrectly included in the input object, likely based on Console-specific defaults that aren't explicitly supported in the SDK.

Additionally, you're processing a multi-page PDF, so it’s important to confirm that the queries are correctly configured for both pages. AWS Textract processes each page as a separate entity, and results may vary if a query’s target content is split across pages or located in a non-standard layout.


Adjusted Code Sample

Here’s a revised version of your code that removes unsupported parameters and improves handling for multi-page documents:

import { TextractClient, StartDocumentAnalysisCommand } from "@aws-sdk/client-textract";

const textractClient = new TextractClient({ region: process.env.AWS_REGION });

const input = {
    FeatureTypes: ["QUERIES"],
    QueriesConfig: {
        Queries: [{ Text: "GSTIN" }, { Text: "Invoice Number" }],
    },
    DocumentLocation: {
        S3Object: {
            Bucket: process.env.AWS_S3_BUCKET || "",
            Name: fileKey,
        },
    },
    ClientRequestToken: `${Date.now()}`,
    NotificationChannel: {
        SNSTopicArn: process.env.AWS_SNS_TOPIC_ARN || "",
        RoleArn: process.env.AWS_ROLE_ARN || "",
    },
    JobTag: "INVOICE_PROCESS",
};

async function startDocumentAnalysis() {
    try {
        const command = new StartDocumentAnalysisCommand(input);
        const response = await textractClient.send(command);
        console.log("Job started successfully:", response.JobId);
    } catch (error) {
        console.error("Error processing invoice:", error);
    }
}

startDocumentAnalysis();

Key Changes:

  1. Removed Invalid Parameters: Removed DocumentReadMode and LanguageCode.
  2. Correct API Command: Used StartDocumentAnalysisCommand for asynchronous job processing, as it's necessary for multi-page PDF files.
  3. ClientRequestToken: Added a unique token for better traceability of your jobs.
  4. NotificationChannel: Ensured proper configuration for S3 and SNS integration to receive job completion notifications.

Debugging Query Results

Once the job completes, retrieve the results using the GetDocumentAnalysisCommand. Ensure you check all pages for QUERY and QUERY_RESULT block types:

import { GetDocumentAnalysisCommand } from "@aws-sdk/client-textract";

async function getJobResults(jobId) {
    try {
        const params = { JobId: jobId };
        const command = new GetDocumentAnalysisCommand(params);
        const response = await textractClient.send(command);

        response.Blocks.forEach((block) => {
            if (block.BlockType === "QUERY_RESULT") {
                console.log(`Query: ${block.Query.Text}`);
                console.log(`Result: ${block.Text}`);
            }
        });

        if (response.NextToken) {
            // Handle paginated results
            await getJobResults(response.NextToken);
        }
    } catch (error) {
        console.error("Error retrieving job results:", error);
    }
}

Steps:

  1. Use the returned JobId to fetch results.
  2. Iterate through all pages and capture QUERY and QUERY_RESULT blocks.
  3. Implement pagination handling with the NextToken.

Improving Results for Specific Fields

If GSTIN or Invoice Number queries still fail for some files:

  • Refine Query Text: Use exact field labels or synonyms found in your documents (e.g., “GST Number” instead of “GSTIN”).
  • Document Preprocessing: Convert PDFs to a simpler format using tools like Tesseract to standardize fonts, layers, and layouts:
    tesseract input.pdf output -l eng pdf
  • Fallback Logic: Use the KEY_VALUE_SET blocks to manually extract GSTIN or Invoice Numbers when queries don’t produce results.

Closing Thoughts

These adjustments should resolve the validation errors and improve consistency in detecting QUERY_RESULT blocks. If you still encounter issues, consider preprocessing problematic PDFs or testing the Textract service with fewer features (e.g., disable tables or forms) to isolate potential layout-related problems.

Let me know how it goes or if you’d like further clarification on any part of this solution! 😊🚀


Cheers,

Aaron 😊

profile picture
answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions