跳至內容

Textract Analyzes Same PDF Differently: Page Missing in CLI/Boto3 but Present in Demo

0

Hi AWS Support,

I’m encountering a problem when using Textract’s StartDocumentAnalysis API with the LAYOUT and TABLES feature types on a 4-page PDF document stored in S3. Textract does not extract correctly the PDF. I’m also experiencing inconsistent behavior depending on the method used (boto3 vs. CLI vs. Demo), the results vary significantly.

Issue Summary

  • Using boto3 SDK:

    • Pages 1 and 2: Fully returned and correct.
    • Page 3: Partially returned, content is truncated.
    • Page 4: Missing entirely.
  • Using AWS CLI:

    • Pages 1–3: Fully returned, including complete page 3.
    • Page 4: Missing entirely.
  • Using the AWS Console Textract Demo (same region and document):

    • All 4 pages are processed correctly with full layout and table extraction.
  • Using single-page PDFs (split version of the same document):

    • All pages are processed correctly regardless of method.

Technical Setup

  • Python Version: 3.9.6
  • Boto3 Version: 1.38.33
  • Region: eu-central-1
  • AWS CLI command:
aws textract start-document-analysis   --document '{"S3Object":{"Bucket":"bucket_name","Name":"file_name"}}'   --feature-types '["TABLES","LAYOUT"]'   --region eu-central-1
  • Python functions to extract layout json:
def start_job(client, bucket_name, file_name):
    response = client.start_document_analysis(
        DocumentLocation={
            "S3Object": {
                "Bucket": bucket_name,
                "Name": file_name,
            }
        },
        FeatureTypes=[
            "TABLES",
            "LAYOUT",
        ],
    )

    job_id = response["JobId"]
    return job_id


def get_job_results(client, job_id):
    pages = []
    time.sleep(5)
    response = client.get_document_analysis(JobId=job_id)
    pages.append(response)
    next_token = None
    if "NextToken" in response:
        next_token = response["NextToken"]

    while next_token:
        time.sleep(5)
        response = client.get_document_analysis(JobId=job_id, NextToken=next_token)
        pages.append(response)
        next_token = None
        if "NextToken" in response:
            next_token = response["NextToken"]

    return pages

Could this be a backend processing limit or bug affecting documents with LAYOUT feature in certain regions or versions? Is there a recommended approach for ensuring consistent, full-document processing across methods?

Please let me know if any additional logs or the document (sanitized) can help further isolate the issue.

Thanks in advance for your support!

已提問 5 個月前檢視次數 53 次
1 個回答
-1

The issue you're experiencing with inconsistent Textract processing results across different methods appears to be related to how Textract handles multi-page documents in different contexts.

Based on your description, there are a few potential causes and solutions to consider:

  1. Pagination handling: When using the asynchronous API (StartDocumentAnalysis), you need to ensure you're properly handling pagination in the results. Your code shows you're checking for NextToken, which is good, but there might be issues with how the results are being processed or combined.

  2. Response size limitations: Textract has limitations on response sizes, which could explain why page 3 is truncated in the boto3 implementation. The console demo might be handling these limitations differently than your direct API calls.

  3. Document complexity: If pages 3 and 4 contain complex tables or layouts, they might be more challenging for Textract to process, leading to inconsistent results across different methods.

  4. API version differences: The console demo might be using a newer version of the Textract service than what's available through your boto3 version.

Recommendations:

  1. Ensure complete pagination handling: Make sure you're fully retrieving all pages of results by properly handling the NextToken parameter in your get_job_results function.

  2. Check document quality: Ensure the PDF is of good quality, properly aligned, and has clear text and table borders.

  3. Try updating boto3: Update to the latest version of boto3 to ensure you're using the most recent API implementations.

  4. Consider document splitting: Since you mentioned that processing individual pages works correctly, you might want to implement a workflow that splits the document and processes each page individually, then combines the results.

  5. Check for region-specific issues: Try processing the document in a different AWS region to see if the issue persists.

If the problem continues, you might want to contact AWS Support directly with the specific document to help them investigate any potential backend issues or limitations in the Textract service.
Sources
Textract Error With Multiples Pages | AWS re:Post
AWS Textract Missing Crucial Data in a Table | AWS re:Post

已回答 5 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。