Use Textract like traditional OCR software to recognize scanned pages of long texts while retraining the formatting?

0

I'm completely new to Textract, and before taking the plunge of learning the API, I wanted to ask if it is possible to use Textract to recognize scanned pages such as books or scholarly articles while retraining the character and paragraph formatting and have it output a RTF or .DOC text file? Many thanks!

asked a year ago357 views
2 Answers
1
Accepted Answer

By formatting, I assume you mean font size and style (e.g. bold, italic)? Currently Textract does not extract information on this type of formatting.

The DetectText API currently provides the following information (source):

  • The lines and words of detected text
  • The relationships between the lines and words of detected text
  • The page that the detected text appears on
  • The location of the lines and words of text on the document page

It can also extract tables, forms, and specific information through queries. This page provides a good overview of the output you can expect.

AWS
S_Moose
answered a year ago
0

Thank you very much for your explanation ! Given that Textract has very high accuracy in terms of correctly recognizing the characters, this would be a great feature to add.

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions