TEXTRACT: Incorrect Layout response objects results and results not in desired order.

0

Background: I am using Textract Analyze document API to detect Layout response objects in a PDF page. The page has Page Headers, Title, Sub-headers, tables, figures, and text. The page is divided into 3 vertical columns, each having some text and tables. Challenge: I have 2 challenges:

  1. Upon using the Layout option from Analyze document API, Textract can correctly identify about 90% of response objects. Some Sub-headers are identified as Text, and sometimes sub-headers are identified as a part of the Table. How can I train my model to identify the response objects correctly?
  2. The order in which these Layout response objects are being presented is completely wrong. Eg. I first want all the response objects of column 1 to be presented followed by that of Column 2 and so on. Is there a way by which I can train the Textract to first identify and print the objects from Column 1 then followed by Column 2?

I am attaching some snippets to better understand my challenges:

Enter image description here Enter image description here

demandé il y a un mois147 vues
1 réponse
0
Réponse acceptée

Using bounding boxes might be helpful. You should try the Textractor Package (amazon-textract-overlayer)

AWS
JoeWil
répondu il y a un mois
profile picture
EXPERT
vérifié il y a un mois
  • Thanks for your answer. Yes, I have been trying that, using bounding boxes to identify the x-min and y-min of response objects and then trying to devise a way to order them. But, the challenge is even using the x-min coordinate, I am not able to differentiate which response objects fall in Column 1, Column 2, or Column 3 of the page. In the output, I have to first order all the objects of column 1, with an increasing value of y-min, followed by that of column 2, and so on. Is there any way or algorithm you can think of to help me achieve this?

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions