Use Textract like traditional OCR software to recognize scanned pages of long texts while retraining the formatting?

0

I'm completely new to Textract, and before taking the plunge of learning the API, I wanted to ask if it is possible to use Textract to recognize scanned pages such as books or scholarly articles while retraining the character and paragraph formatting and have it output a RTF or .DOC text file? Many thanks!

질문됨 일 년 전372회 조회
2개 답변
1
수락된 답변

By formatting, I assume you mean font size and style (e.g. bold, italic)? Currently Textract does not extract information on this type of formatting.

The DetectText API currently provides the following information (source):

  • The lines and words of detected text
  • The relationships between the lines and words of detected text
  • The page that the detected text appears on
  • The location of the lines and words of text on the document page

It can also extract tables, forms, and specific information through queries. This page provides a good overview of the output you can expect.

AWS
S_Moose
답변함 일 년 전
0

Thank you very much for your explanation ! Given that Textract has very high accuracy in terms of correctly recognizing the characters, this would be a great feature to add.

답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠