analyze document API doesn't fetch accurate table information.


I am trying to extract and parse an invoice pdf that has tabular data, using Python. The table has a few columns that have rowspan of 2. Textract is unable to fetch the text for such rows. In the response, It shows rowspan as 1. So if the column has header as "Description of items", it is fetching "Description of" as one row and "items" as one row.

Sample pdf input

Output is below.

Table[0][2] = Designer Table[1][2] = Code

Table[0][1] = of Table[1][1] = Description Goods

Anyone has faced this issue or solved it, please suggest a solution.


asked a year ago319 views
1 Answer

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions