Amazon Textract not reading Line item tables from page 2 of an invoice PDF

0

I have a 2 page Invoice from a customer and i am using Extract Expense Api from Textract to read it. But the api does not read line items in page 2 of the PDF in the proper structure.

The Line items table on Page 2 is similar to the line items table on Page 1 and the PDF is of very good quality

Am using StartExpenseAnalysisRequest in async mode as it is a multi page pdf

Am not able to attach the pdf here as it is from a client.

asked 2 months ago29 views
1 Answer
0

The issue you're experiencing with Amazon Textract not properly reading line items on page 2 of your invoice PDF could be due to a few factors:

  1. Asynchronous vs. Synchronous Processing: Textract's synchronous APIs only support single-page documents. For multi-page PDFs, you need to use asynchronous APIs like StartDocumentAnalysis. Make sure you're using the appropriate API for multi-page documents.

  2. Table Complexity: Textract may struggle with complex table structures, merged cells, or unusual formatting. If the table on page 2 has a different layout or complexity compared to page 1, this could cause issues.

  3. Image Quality: The quality of the scanned document can affect Textract's ability to accurately extract data. Ensure that both pages have good resolution, are properly aligned, and have sufficient contrast.

  4. Table Borders: If the table on page 2 lacks clear borders or has faint lines, Textract might not recognize it as a complete table structure.

To address this issue:

  1. Verify that you're using the asynchronous API for multi-page document analysis.
  2. Check the quality and formatting of the second page, ensuring it's consistent with the first page.
  3. If possible, improve the contrast of the table borders and text on the second page.
  4. Consider implementing additional post-processing logic to handle cases where Textract fails to extract all the data correctly.

If the problem persists after trying these solutions, you may need to contact AWS support for further assistance, as they can provide more specific guidance based on your particular use case and document structure.
Sources
Textract Error With Multiples Pages | AWS re:Post
AWS Textract Missing Crucial Data in a Table | AWS re:Post

profile picture
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions