TEXTRACT: Incorrect Layout response objects results and results not in desired order.

0

Background: I am using Textract Analyze document API to detect Layout response objects in a PDF page. The page has Page Headers, Title, Sub-headers, tables, figures, and text. The page is divided into 3 vertical columns, each having some text and tables. Challenge: I have 2 challenges:

  1. Upon using the Layout option from Analyze document API, Textract can correctly identify about 90% of response objects. Some Sub-headers are identified as Text, and sometimes sub-headers are identified as a part of the Table. How can I train my model to identify the response objects correctly?
  2. The order in which these Layout response objects are being presented is completely wrong. Eg. I first want all the response objects of column 1 to be presented followed by that of Column 2 and so on. Is there a way by which I can train the Textract to first identify and print the objects from Column 1 then followed by Column 2?

I am attaching some snippets to better understand my challenges:

Enter image description here Enter image description here

asked 22 days ago120 views
1 Answer
0
Accepted Answer

Using bounding boxes might be helpful. You should try the Textractor Package (amazon-textract-overlayer)

AWS
JoeWil
answered 21 days ago
profile picture
EXPERT
reviewed 21 days ago
  • Thanks for your answer. Yes, I have been trying that, using bounding boxes to identify the x-min and y-min of response objects and then trying to devise a way to order them. But, the challenge is even using the x-min coordinate, I am not able to differentiate which response objects fall in Column 1, Column 2, or Column 3 of the page. In the output, I have to first order all the objects of column 1, with an increasing value of y-min, followed by that of column 2, and so on. Is there any way or algorithm you can think of to help me achieve this?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions