Use Textract like traditional OCR software to recognize scanned pages of long texts while retraining the formatting?

0

I'm completely new to Textract, and before taking the plunge of learning the API, I wanted to ask if it is possible to use Textract to recognize scanned pages such as books or scholarly articles while retraining the character and paragraph formatting and have it output a RTF or .DOC text file? Many thanks!

demandé il y a un an373 vues
2 réponses
1
Réponse acceptée

By formatting, I assume you mean font size and style (e.g. bold, italic)? Currently Textract does not extract information on this type of formatting.

The DetectText API currently provides the following information (source):

  • The lines and words of detected text
  • The relationships between the lines and words of detected text
  • The page that the detected text appears on
  • The location of the lines and words of text on the document page

It can also extract tables, forms, and specific information through queries. This page provides a good overview of the output you can expect.

AWS
S_Moose
répondu il y a un an
0

Thank you very much for your explanation ! Given that Textract has very high accuracy in terms of correctly recognizing the characters, this would be a great feature to add.

répondu il y a un an

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions