Use Textract like traditional OCR software to recognize scanned pages of long texts while retraining the formatting?

0

I'm completely new to Textract, and before taking the plunge of learning the API, I wanted to ask if it is possible to use Textract to recognize scanned pages such as books or scholarly articles while retraining the character and paragraph formatting and have it output a RTF or .DOC text file? Many thanks!

feita há um ano372 visualizações
2 Respostas
1
Resposta aceita

By formatting, I assume you mean font size and style (e.g. bold, italic)? Currently Textract does not extract information on this type of formatting.

The DetectText API currently provides the following information (source):

  • The lines and words of detected text
  • The relationships between the lines and words of detected text
  • The page that the detected text appears on
  • The location of the lines and words of text on the document page

It can also extract tables, forms, and specific information through queries. This page provides a good overview of the output you can expect.

AWS
S_Moose
respondido há um ano
0

Thank you very much for your explanation ! Given that Textract has very high accuracy in terms of correctly recognizing the characters, this would be a great feature to add.

respondido há um ano

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas