Is it possible to maintain the shape of a pdf using textract? and translate docs with translate?

0

Hi, good evening. I would like to ask if there is a way to maintain the visual structure or shape of a pdf file (whether it is a text-only file or with tables) using only the 'ocr' function of the textract service? I would need to translate large quantities of documents that are not always printed well or digitized well later. I tried to do some tests and the text extraction is very precise and using 'Translate', I would be able to speed up the work a lot. So I'd like to ask if there's a way to keep the PDF a bit integrated? Or if i can do it in a second time with some functions?

second question: is it possible to translate documents in PDF or Word format with the translate service?

Thanks in advance for your reply. Btw, happy new year :)

  • hey, may I know if you finally figure it out? I have a similar requirement with you. thanks

質問済み 1年前919ビュー
1回答
0

Hi, If you want to extract the structure of the document, the best way would be to use the AnalyzeDocument API, it will extract the different relations and structural element such as Table, Key Value pair, ... However if you want to only use the DetectText Apis, you will get the bounding box coordinate for each of the WORD or LINE detected, which you can use to reconstruct the document by placing the text in it's original position. (https://docs.aws.amazon.com/textract/latest/dg/how-it-works-document-layout.html the information is in Geometry) With this you will just have the text and no information regarding the Table structured or any other information that was previously in the document.

Regarding your second question, Textract doesn't do document conversion, we are extracting text and structure information from the document, but we are not recreating a document similar to the one that you sent.

I hope it helps. Happy New Year to you as well :)

AWS
回答済み 1年前
  • Not being a developer, it's a bit complicated for me. May I ask where you have to put the Json code? I thought there was a link where you put the pdf file to get the ocr. Thanks for the answer though :)

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ