- Newest
- Most votes
- Most comments
The issue you're experiencing with Amazon Textract not properly reading line items on page 2 of your invoice PDF could be due to a few factors:
-
Asynchronous vs. Synchronous Processing: Textract's synchronous APIs only support single-page documents. For multi-page PDFs, you need to use asynchronous APIs like StartDocumentAnalysis. Make sure you're using the appropriate API for multi-page documents.
-
Table Complexity: Textract may struggle with complex table structures, merged cells, or unusual formatting. If the table on page 2 has a different layout or complexity compared to page 1, this could cause issues.
-
Image Quality: The quality of the scanned document can affect Textract's ability to accurately extract data. Ensure that both pages have good resolution, are properly aligned, and have sufficient contrast.
-
Table Borders: If the table on page 2 lacks clear borders or has faint lines, Textract might not recognize it as a complete table structure.
To address this issue:
- Verify that you're using the asynchronous API for multi-page document analysis.
- Check the quality and formatting of the second page, ensuring it's consistent with the first page.
- If possible, improve the contrast of the table borders and text on the second page.
- Consider implementing additional post-processing logic to handle cases where Textract fails to extract all the data correctly.
If the problem persists after trying these solutions, you may need to contact AWS support for further assistance, as they can provide more specific guidance based on your particular use case and document structure.
Sources
Textract Error With Multiples Pages | AWS re:Post
AWS Textract Missing Crucial Data in a Table | AWS re:Post
Relevant content
- asked a year ago
- asked 4 years ago
- asked 4 months ago
- AWS OFFICIALUpdated 6 months ago
- AWS OFFICIALUpdated 10 months ago
- AWS OFFICIALUpdated 2 years ago